Replication Works unstable and we are unable to join a new node to cluster

pavelnemirovsky commented 2 years ago

Describe the bug
I can't add new nodes into a cluster with a single node

To Reproduce
Steps to reproduce the behavior:

I have a few indices about 200GB each and 10-20 other indices that are smaller sizes 5-6Gb
When I am trying to join a new node to the cluster, its transfers a tiny portion of data and gets stuck for some time, and fails after 20 min with the following error on the side of the node that tried to join the cluster: (that process worked fine at 4.0.2)
```
mysql> JOIN CLUSTER DMETRICS_FTS_1 AT '10.0.82.48:9312';  
ERROR 1064 (42000): '10.0.82.123:9312': error when sending data: Broken pipe  
MySQL>  
```

Expected behavior
A clear and concise description of what you expected to happen.
I need to be able to recover replication and join new node to cluster

Describe the environment:

Manticore Search version (top line in output of bin/searchd -v or bin/indexer -v): e.g. 3.5.0 1d34c49@200722 release

searchd -v  
Manticore 5.0.2 348514c86@220530 dev (columnar 1.15.4 2fef34e@220522) (secondary 1.15.4 2fef34e@220522)  
Copyright (c) 2001-2016, Andrew Aksyonoff  
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)  
Copyright (c) 2017-2022, Manticore Software LTD (https://manticoresearch.com)

OS version (uname -a if on a Unix-like system):

Linux clickhouse-db-01.dmetrics.internal 5.4.0-1064-aws #67~18.04.1-Ubuntu SMP Mon Jan 17 16:44:56 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Messages from log files:
Messages from searchd.log and query.log (if applicable).

Additional context
Add any other context about the problem here.
In case you've faced a crash what indextool --check returns.

tomatolog commented 2 years ago

you need to restart donor and joiner nodes with --logreplication then upload searchd.log from both nodes to investigate issue further

pavelnemirovsky commented 2 years ago

@tomatolog got you.

pavelnemirovsky commented 2 years ago

I'll create the reproducible use case.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Feel free to re-open the issue in case it becomes actual.

pavelnemirovsky commented 2 years ago

Guys I was able to build a pilot to prove the issue reported in the subject of this ticket, see below the logs (this use case represents the issue when I have two nodes in the cluster, I am trying to bootstrap the second node from the first one to recover the cluster state, but somehow it's stuck with following error:

[Tue Aug 30 08:35:29.509 2022] [19369] WARNING: Could not find peer: 89a27af0-1ecb-11ed-9d1f-4e28fa9e4e27
[Tue Aug 30 08:35:29.509 2022] [19369] WARNING: 1.0 (daemon_19352_DMETRICS_FTS_1): State transfer to -1.-1 (left the group) failed: -125 (Operation canceled)
[Tue Aug 30 08:35:29.520 2022] [19369] WARNING: Protocol violation. JOIN message sender 1.0 (daemon_19352_DMETRICS_FTS_1) is not in state transfer (SYNCED). Message ignored.
[Tue Aug 30 08:52:56.900 2022] [19370] WARNING: '10.0.82.150:9312': remote error: invalid length (code=18, len=91050704)
[Tue Aug 30 08:52:56.910 2022] [19370] WARNING: last message repeated 1 times

10.0.82.16 (Master Node with safe_to_bootstrap: 1)

# Ansible managed
# https://github.com/manticoresoftware/manticoresearch/blob/master/manual/Server_settings/Searchd.md#node_address
common {
    # https://manual.manticoresearch.com/Server_settings/Common#lemmatizer_base
    lemmatizer_base = /usr/share/manticore/nlp/

    # https://manual.manticoresearch.com/Server_settings/Common#progressive_merge
    # progressive_merge =

    # https://manual.manticoresearch.com/Server_settings/Common#json_autoconv_keynames
    # json_autoconv_keynames =

    # https://manual.manticoresearch.com/Server_settings/Common#json_autoconv_numbers
    # json_autoconv_numbers = 0

    # https://manual.manticoresearch.com/Server_settings/Common#on_json_attr_error
    # on_json_attr_error = ignore_attr

    # plugin_dir =
}

searchd {
  server_id = manticore-01.dmetrics.internal
  listen = 127.0.0.1:9306:mysql
  listen = 127.0.0.1:9308:http
  listen = 10.0.82.16:9312
  listen = 10.0.82.16:9306:mysql
  listen = 10.0.82.16:9308:http
  listen = 10.0.82.16:9360-9370:replication
  pid_file = /var/run/manticore/searchd.pid
  max_packet_size = 32M
  binlog_flush = 1 # ultimate safety, low speed
  query_log_format = sphinxql
  mysql_version_string = 5.0.37
  data_dir = /var/lib/data/manticore
  binlog_path = /var/lib/data/manticore
  log = /var/log/manticore/searchd.log
  query_log = /var/log/manticore/query.log
  attr_flush_period = 60
  # Flushing RT RAM chunks each 5 min
  rt_flush_period = 300
  preopen_indexes = 0
  # Available starting from 4.2.x
  # pseudo_sharding = 1
}

Debug logs with --logreplication and show status manticore-master-node.log

10.0.82.101 (Node that wants to join to the cluster)

# Ansible managed
# https://github.com/manticoresoftware/manticoresearch/blob/master/manual/Server_settings/Searchd.md#node_address
common {
    # https://manual.manticoresearch.com/Server_settings/Common#lemmatizer_base
    lemmatizer_base = /usr/share/manticore/nlp/

    # https://manual.manticoresearch.com/Server_settings/Common#progressive_merge
    # progressive_merge =

    # https://manual.manticoresearch.com/Server_settings/Common#json_autoconv_keynames
    # json_autoconv_keynames =

    # https://manual.manticoresearch.com/Server_settings/Common#json_autoconv_numbers
    # json_autoconv_numbers = 0

    # https://manual.manticoresearch.com/Server_settings/Common#on_json_attr_error
    # on_json_attr_error = ignore_attr

    # plugin_dir =
}

searchd {
  server_id = manticore-02.dmetrics.internal
  listen = 127.0.0.1:9306:mysql
  listen = 127.0.0.1:9308:http
  listen = 10.0.82.101:9312
  listen = 10.0.82.101:9306:mysql
  listen = 10.0.82.101:9308:http
  listen = 10.0.82.101:9360-9370:replication
  pid_file = /var/run/manticore/searchd.pid
  max_packet_size = 32M
  binlog_flush = 1 # ultimate safety, low speed
  query_log_format = sphinxql
  mysql_version_string = 5.0.37
  data_dir = /var/lib/data/manticore
  binlog_path = /var/lib/data/manticore
  log = /var/log/manticore/searchd.log
  query_log = /var/log/manticore/query.log
  attr_flush_period = 60
  # Flushing RT RAM chunks each 5 min
  rt_flush_period = 300
  preopen_indexes = 0
  # Available starting from 4.2.x
  # pseudo_sharding = 1
}

Debug logs with --logreplication and show status manticore-slave-node.log

pavelnemirovsky commented 2 years ago

@tomatolog and @sanikolaev guys can you please re-open the ticket?

pavelnemirovsky commented 2 years ago

Guys pay attention this graph represents monitoring data for the master node that serves as a donor during cluster recovery (process interrupts after 25m exactly and during that time no data transfer over network and just crazy IO produced on host itself:

tomatolog commented 2 years ago

IO prior of SST means that daemon issue rsync like process of read index files then calc hash of files parts to make sure to transfer the parts of index files these are differs

pavelnemirovsky commented 2 years ago

@tomatolog yes, I read the documentation of Galera. The problem is that Replication transfers small indices up to 10Gb and we can't transfer any big ones and crashes in the middle during calculating hashes of file parts ....

pavelnemirovsky commented 2 years ago

@tomatolog did you check the reported logs?

tomatolog commented 2 years ago

I see no crashes at the log files you provided.

At slave node I see the message

WARNING: ill-formed client request (length=91050704 out of bounds)

at the 14:07 that warning said that slave got packet from donor node that overflows max_packet_size searchd config option.

Could you check that you have same settings for searchd section of config for timeouts and packet_size ?

In case you said about another crash please provide time of the event.

pavelnemirovsky commented 2 years ago

@tomatolog thanks, the configs appear above where max_packet_size = 32M and not timeouts. I'll try to increase it to 128M and report, but I doubt it is the issue because I don't see network I/O that goes back and forth.

tomatolog commented 2 years ago

it is the only error I see in the log that breaks network communication between nodes then SST reports error and exits

In case you have different scenario when replication breaks then upload daemon logs from donor and joiner node to check

pavelnemirovsky commented 2 years ago

@tomatolog since there is no network IO and huge disk IO is running on the donor's node it sounds to me like a socket idle occurs when it occurs the process that is in charge of building checksums of file parts is stopping the replication process.\ did you test cluster recovery on big indices like 300GB?

pavelnemirovsky commented 2 years ago

after changing max_packet_size=128M looks like replication started ... I'll report more soon.

pavelnemirovsky commented 2 years ago

The cluster has been synced completely, @tomatolog thank you

tomatolog commented 2 years ago

I added this ticket to backlog ticket about max_packet_size during SST and will write you here when we will push fix for original ticket.

sanikolaev commented 1 year ago

binlog_flush = 1 # ultimate safety, low speed

FYI: it's turned out this has been broken since Sphinx and was fixed just yesterday

pavelnemirovsky commented 1 year ago

@sanikolaev thx

manticoresoftware / manticoresearch

Replication Works unstable and we are unable to join a new node to cluster #806