Manticore cluster breakdown

setor commented 1 year ago

Description

Manticore cluster of 3 nodes with master-slave replication every time breakdown after first worker node resart

cluster deployed by Helm chart in Kubernetes v1.25.6, Coredns 1.9.3

the main error is: WARNING: cluster 'weox_cluster': invalid nodes '10.233.121.99:9315,10.233.86.243:9315'(10.233.65.7:9312,10.233.86.243:9312,10.233.121.99:9312), replication is disabled, error: no AF_INET address found for: manticore-worker-0.manticore-worker-svc

DNS name manticore-worker-0.manticore-worker-svc is resolving

nslookup manticore-worker-0.manticore-worker-svc    
Server:         169.254.25.10    
Address:        169.254.25.10#53    

Name:   manticore-worker-0.manticore-worker-svc.manticore-dev.svc.cluster.local    
Address: 10.233.65.7

How to reproduce

restart the first worker POD manticore-worker-0

searchd -c /etc/manticoresearch/manticore.conf --nodetach --logreplication    
Manticore 6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)    
Copyright (c) 2001-2016, Andrew Aksyonoff    
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)    
Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)    

[36:21.369] [915] using config file '/etc/manticoresearch/manticore.conf' (414 chars)...    
[36:21.371] [915] DEBUG: config loaded, tables 0, clusters 1    
[36:21.371] [915] DEBUG: 'read_timeout' - nothing specified, using default value 5000000    
[36:21.371] [915] DEBUG: 'network_timeout' - nothing specified, using default value 5000000    
[36:21.371] [915] DEBUG: 'sphinxql_timeout' - nothing specified, using default value 900000000    
[36:21.371] [915] DEBUG: 'client_timeout' - nothing specified, using default value 300000000    
[36:21.371] [915] DEBUG: SetMaxChildrenThreads to 16    
[36:21.371] [915] DEBUG: 'read_unhinted' - nothing specified, using default value 32768    
[36:21.371] [915] DEBUG: 'read_buffer' - nothing specified, using default value 262144    
[36:21.371] [915] DEBUG: 'read_buffer_docs' - nothing specified, using default value 262144    
[36:21.371] [915] DEBUG: 'read_buffer_hits' - nothing specified, using default value 262144    
[36:21.371] [915] DEBUG: 'attr_flush_period' - nothing specified, using default value 0    
[36:21.371] [915] DEBUG: 'max_packet_size' - nothing specified, using default value 8388608    
[36:21.371] [915] DEBUG: 'rt_merge_maxiosize' - nothing specified, using default value 0    
[36:21.371] [915] DEBUG: 'ha_ping_interval' - nothing specified, using default value 1000000    
[36:21.371] [915] DEBUG: 'ha_period_karma' - nothing specified, using default value 60000000    
[36:21.371] [915] DEBUG: 'query_log_min_msec' - nothing specified, using default value 0    
[36:21.371] [915] DEBUG: 'agent_connect_timeout' - nothing specified, using default value 1000000    
[36:21.371] [915] DEBUG: 'agent_query_timeout' - nothing specified, using default value 3000000    
[36:21.371] [915] DEBUG: 'agent_retry_delay' - nothing specified, using default value 500000    
[36:21.371] [915] DEBUG: 'net_wait_tm' - nothing specified, using default value -1    
[36:21.371] [915] DEBUG: 'docstore_cache_size' - nothing specified, using default value 16777216    
[36:21.371] [915] DEBUG: 'skiplist_cache_size' - nothing specified, using default value 67108864    
[36:21.371] [915] DEBUG: 'qcache_max_bytes' - nothing specified, using default value 16777216    
[36:21.371] [915] DEBUG: 'qcache_thresh_msec' - nothing specified, using default value 3000000    
[36:21.372] [915] DEBUG: 'qcache_ttl_sec' - nothing specified, using default value 60000000    
[36:21.372] [915] DEBUG: current working directory changed to '/var/lib/manticore'    
[36:21.373] [915] DEBUG: StartGlobalWorkpool    
[36:21.373] [915] DEBUG: StartGlobalWorkpool    
[36:21.375] [915] starting daemon version '6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)' ...    
[36:21.375] [915] starting daemon version '6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)' ...    
[36:21.375] [915] listening on UNIX socket /var/run/mysqld/mysqld.sock    
[36:21.375] [915] listening on UNIX socket /var/run/mysqld/mysqld.sock    
[36:21.375] [915] listening on all interfaces for mysql, port=9306    
[36:21.375] [915] listening on all interfaces for mysql, port=9306    
[36:21.376] [915] listening on all interfaces for sphinx and http(s), port=9308    
[36:21.376] [915] listening on all interfaces for sphinx and http(s), port=9308    
[36:21.376] [915] listening on all interfaces for VIP mysql, port=9301    
[36:21.376] [915] listening on all interfaces for VIP mysql, port=9301    
[36:21.376] [915] listening on 10.233.65.7:9312 for sphinx and http(s)    
[36:21.376] [915] listening on 10.233.65.7:9312 for sphinx and http(s)    
[36:21.376] [915] DEBUG: 'rt_flush_period' - nothing specified, using default value 36000000000    
[36:21.376] [915] DEBUG: 'rt_flush_period' - nothing specified, using default value 36000000000    
[36:21.376] [919] RPL: 1 clusters loaded from config    
[36:21.376] [919] RPL: 1 clusters loaded from config    
[36:21.376] [919] DEBUG: no valid tables to serve    
[36:21.376] [919] DEBUG: no valid tables to serve    
[36:21.378] [915] DEBUG: expression stack for creation is 16. Consider to add env MANTICORE_KNOWN_CREATE_SIZE=16 to store this value persistent for this binary    
[36:21.378] [915] DEBUG: expression stack for creation is 16. Consider to add env MANTICORE_KNOWN_CREATE_SIZE=16 to store this value persistent for this binary    
[36:21.382] [915] DEBUG: expression stack for eval/deletion is 32. Consider to add env MANTICORE_KNOWN_EXPR_SIZE=32 to store this value persistent for this binary    
[36:21.382] [915] DEBUG: expression stack for eval/deletion is 32. Consider to add env MANTICORE_KNOWN_EXPR_SIZE=32 to store this value persistent for this binary    
[36:21.397] [915] DEBUG: filter stack delta is 224. Consider to add env MANTICORE_KNOWN_FILTER_SIZE=224 to store this value persistent for this binary    
[36:21.397] [915] DEBUG: filter stack delta is 224. Consider to add env MANTICORE_KNOWN_FILTER_SIZE=224 to store this value persistent for this binary    
[36:21.397] [915] DEBUG: 'binlog_max_log_size' - nothing specified, using default value 268435456    
[36:21.397] [915] DEBUG: 'binlog_max_log_size' - nothing specified, using default value 268435456    
[36:21.397] [915] DEBUG: MAC address 62:79:0c:2b:6e:e8 for uuid-short server_id    
[36:21.397] [915] DEBUG: MAC address 62:79:0c:2b:6e:e8 for uuid-short server_id    
[36:21.398] [915] DEBUG: uid-short server_id 98, started 128457381, seed 7063799372944769024    
[36:21.398] [915] DEBUG: uid-short server_id 98, started 128457381, seed 7063799372944769024    
[36:21.398] [921] binlog: replaying log /var/lib/manticore/data/binlog.001    
[36:21.398] [921] binlog: replaying log /var/lib/manticore/data/binlog.001    
[36:21.398] [921] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables    
[36:21.398] [921] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables    
[36:21.398] [921] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec    
[36:21.398] [921] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec    
[36:21.398] [921] binlog: replaying log /var/lib/manticore/data/binlog.001    
[36:21.398] [921] binlog: replaying log /var/lib/manticore/data/binlog.001    
[36:21.398] [921] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables    
[36:21.398] [921] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables    
[36:21.398] [921] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec    
[36:21.398] [921] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec    
[36:21.398] [921] binlog: replaying log /var/lib/manticore/data/binlog.001    
[36:21.398] [921] binlog: replaying log /var/lib/manticore/data/binlog.001    
[36:21.398] [921] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables    
[36:21.398] [921] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables    
[36:21.398] [921] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec    
[36:21.398] [921] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec    
[36:21.398] [921] binlog: finished replaying total 3 in 0.000 sec    
[36:21.398] [921] binlog: finished replaying total 3 in 0.000 sec    
[36:21.400] [921] DEBUG: SaveMeta: Done (/var/lib/manticore/data/binlog.meta.new)    
[36:21.400] [921] DEBUG: SaveMeta: Done (/var/lib/manticore/data/binlog.meta.new)    
[36:21.400] [923] prereading 0 tables    
[36:21.400] [923] prereading 0 tables    
[36:21.400] [923] preread 0 tables in 0.000 sec    
[36:21.400] [923] preread 0 tables in 0.000 sec    
[36:21.408] [927] WARNING: cluster 'weox_cluster': invalid nodes '10.233.121.99:9315,10.233.86.243:9315'(10.233.65.7:9312,10.233.86.243:9312,10.233.121.99:9312), replication is disabled, error: no AF_INET address found for: manticore-worker-0.manticore-worker-svc    
[36:21.408] [927] WARNING: cluster 'weox_cluster': invalid nodes '10.233.121.99:9315,10.233.86.243:9315'(10.233.65.7:9312,10.233.86.243:9312,10.233.121.99:9312), replication is disabled, error: no AF_INET address found for: manticore-worker-0.manticore-worker-svc    
[36:21.408] [927] DEBUG: cluster (null) wait to finish    
[36:21.408] [927] DEBUG: cluster (null) wait to finish    
wsrep loader: [WARN] wsrep_unload(): null pointer.    
[36:21.408] [927] DEBUG: cluster (null) finished, cluster deleted lib (nil) unloaded    
[36:21.408] [927] DEBUG: cluster (null) finished, cluster deleted lib (nil) unloaded    
[36:21.421] [915] DEBUG: dlopen(libcurl.so.4)=0x55e54cdcc250    
[36:21.421] [915] DEBUG: dlopen(libcurl.so.4)=0x55e54cdcc250    
[36:21.424] [915] accepting connections    
[36:21.424] [915] accepting connections    
[36:21.581] [924] DEBUG: dlopen(libzstd.so.1)=0x7f76a0000f80    
[36:21.581] [924] DEBUG: dlopen(libzstd.so.1)=0x7f76a0000f80    
[36:22.101] [920] [BUDDY] started '/usr/share/manticore/modules/manticore-buddy --listen=http://0.0.0.0:9308  --threads=16' at http://127.0.0.1:34955    
[36:22.101] [920] [BUDDY] started '/usr/share/manticore/modules/manticore-buddy --listen=http://0.0.0.0:9308  --threads=16' at http://127.0.0.1:34955    
command terminated with exit code 137

and logs from replica.php:
in an endless loop the same error messages (thousands of log lines per minute):

localhost - 2023-05-26 18:50:27 - 3 - Error until query processing. Query: JOIN CLUSTER weox_cluster at 'manticore-worker-0.manticore-worker-svc:9312'    
. Error: cluster 'weox_cluster', no nodes available(manticore-worker-0.manticore-worker-svc:9312), error: 'manticore-worker-0.manticore-worker-svc:9312': remote error: unknown cluster 'weox_cluster'    
localhost - 2023-05-26 18:50:28 - 3 - Query: JOIN CLUSTER weox_cluster at 'manticore-worker-0.manticore-worker-svc:9312'    
[Fri May 26 18:50:28.262 2023] [58] FATAL: unknown cluster 'weox_cluster'    
FATAL: unknown cluster 'weox_cluster'    
[Fri May 26 18:50:28.263 2023] [61] WARNING: 'manticore-worker-0.manticore-worker-svc:9312': remote error: unknown cluster 'weox_cluster'    
WARNING: 'manticore-worker-0.manticore-worker-svc:9312': remote error: unknown cluster 'weox_cluster'    
wsrep loader: [WARN] wsrep_unload(): null pointer.    
localhost - 2023-05-26 18:50:28 - 3 - Exception until query processing. Query: JOIN CLUSTER weox_cluster at 'manticore-worker-0.manticore-worker-svc:9312'    
. Error: mysqli_sql_exception: cluster 'weox_cluster', no nodes available(manticore-worker-0.manticore-worker-svc:9312), error: 'manticore-worker-0.manticore-worker-svc:9312': remote error: unknown cluster 'weox_cluster' in /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php:218    
Stack trace:    
#0 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(218): mysqli->query()    
#1 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(237): Core\Manticore\ManticoreConnector->query()    
#2 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(237): Core\Manticore\ManticoreConnector->query()    
#3 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(237): Core\Manticore\ManticoreConnector->query()    
#4 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(237): Core\Manticore\ManticoreConnector->query()    
#5 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(237): Core\Manticore\ManticoreConnector->query()    
#6 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(237): Core\Manticore\ManticoreConnector->query()    
#7 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(237): Core\Manticore\ManticoreConnector->query()    
...    
#827 {main}

I can provide more details if needed.
It's very important for us to fix this issue because we lost our Manticore cluster every time after any fail or maintenance.

setor commented 1 year ago

Screenshot with amount of logs from Manticore after node restart

setor commented 1 year ago

Logs from POD manticore-worker-0 after restart (removed lines with ManticoreConnector.php)

I'm interested in the line Pods count 2 - there is total 3 pods, but one is not yet ready because current node is starting up and k8s healthcheck is not yet activated this node and probably there is an issue...


May 26, 2023 @ 21:36:35.629#3 {main}
May 26, 2023 @ 21:36:35.629localhost - 2023-05-26 18:36:35 - 3 - Error until query processing. Query: JOIN CLUSTER weox_cluster at 'manticore-worker-0.manticore-worker-svc:9312'
May 26, 2023 @ 21:36:35.629. Error: cluster 'weox_cluster', no nodes available(manticore-worker-0.manticore-worker-svc:9312), error: manticore-worker-0.manticore-worker-svc:9312 invalid node, no AF_INET address found for: manticore-worker-0.manticore-worker-svc
May 26, 2023 @ 21:36:35.427wsrep loader: [WARN] wsrep_unload(): null pointer.
May 26, 2023 @ 21:36:35.422localhost - 2023-05-26 18:36:35 - 3 - Query: JOIN CLUSTER weox_cluster at 'manticore-worker-0.manticore-worker-svc:9312'
May 26, 2023 @ 21:36:35.421localhost - 2023-05-26 18:36:35 - 3 - Query: show status
May 26, 2023 @ 21:36:35.421localhost - 2023-05-26 18:36:35 - 3 - Join to manticore-worker-0
May 26, 2023 @ 21:36:35.375[Fri May 26 18:36:35.374 2023] [47] [BUDDY] started '/usr/share/manticore/modules/manticore-buddy --listen=http://0.0.0.0:9308  --threads=16' at http://127.0.0.1:40821
May 26, 2023 @ 21:36:35.375[BUDDY] started '/usr/share/manticore/modules/manticore-buddy --listen=http://0.0.0.0:9308  --threads=16' at http://127.0.0.1:40821
May 26, 2023 @ 21:36:35.3752023-05-26 18:36:35,375 INFO success: searchd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
May 26, 2023 @ 21:36:34.5192023-05-26 18:36:34,518 INFO success: searchd_replica entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
May 26, 2023 @ 21:36:34.5182023-05-26 18:36:34,518 INFO success: quorum_recover entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
May 26, 2023 @ 21:36:34.518localhost - 2023-05-26 18:36:34 - 3 - Query: show status
May 26, 2023 @ 21:36:34.194[Fri May 26 18:36:34.192 2023] [43] accepting connections
May 26, 2023 @ 21:36:34.194accepting connections
May 26, 2023 @ 21:36:34.165wsrep loader: [WARN] wsrep_unload(): null pointer.
May 26, 2023 @ 21:36:34.164[Fri May 26 18:36:34.163 2023] [55] WARNING: cluster 'weox_cluster': invalid nodes '10.233.121.99:9315,10.233.86.243:9315'(10.233.86.243:9312,10.233.121.99:9312), replication is disabled, error: no AF_INET address found for: manticore-worker-0.manticore-worker-svc
May 26, 2023 @ 21:36:34.164WARNING: cluster 'weox_cluster': invalid nodes '10.233.121.99:9315,10.233.86.243:9315'(10.233.86.243:9312,10.233.121.99:9312), replication is disabled, error: no AF_INET address found for: manticore-worker-0.manticore-worker-svc
May 26, 2023 @ 21:36:34.147[Fri May 26 18:36:34.146 2023] [50] prereading 0 tables
May 26, 2023 @ 21:36:34.147prereading 0 tables
May 26, 2023 @ 21:36:34.147[Fri May 26 18:36:34.146 2023] [50] preread 0 tables in 0.000 sec
May 26, 2023 @ 21:36:34.147preread 0 tables in 0.000 sec
May 26, 2023 @ 21:36:34.143binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.143 2023] [49] binlog: finished replaying total 4 in 0.000 sec
May 26, 2023 @ 21:36:34.143binlog: finished replaying total 4 in 0.000 sec
May 26, 2023 @ 21:36:34.108[Fri May 26 18:36:34.107 2023] [43] listening on 10.233.65.7:9312 for sphinx and http(s)
May 26, 2023 @ 21:36:34.108[Fri May 26 18:36:34.107 2023] [43] listening on all interfaces for sphinx and http(s), port=9308
May 26, 2023 @ 21:36:34.108listening on all interfaces for sphinx and http(s), port=9308
May 26, 2023 @ 21:36:34.108[Fri May 26 18:36:34.107 2023] [43] listening on all interfaces for VIP mysql, port=9301
May 26, 2023 @ 21:36:34.108listening on all interfaces for VIP mysql, port=9301
May 26, 2023 @ 21:36:34.108listening on 10.233.65.7:9312 for sphinx and http(s)
May 26, 2023 @ 21:36:34.107listening on all interfaces for mysql, port=9306
May 26, 2023 @ 21:36:34.107[Fri May 26 18:36:34.106 2023] [43] starting daemon version '6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)' ...
May 26, 2023 @ 21:36:34.107starting daemon version '6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)' ...
May 26, 2023 @ 21:36:34.107[Fri May 26 18:36:34.107 2023] [43] listening on UNIX socket /var/run/mysqld/mysqld.sock
May 26, 2023 @ 21:36:34.107listening on UNIX socket /var/run/mysqld/mysqld.sock
May 26, 2023 @ 21:36:34.107[Fri May 26 18:36:34.107 2023] [43] listening on all interfaces for mysql, port=9306
May 26, 2023 @ 21:36:34.098[Fri May 26 18:36:34.096 2023] [43] using config file '/etc/manticoresearch/manticore.conf' (414 chars)...
May 26, 2023 @ 21:36:34.0502023-05-26 18:36:34,049 INFO spawned: 'searchd' with pid 43
May 26, 2023 @ 21:36:33.598localhost - 2023-05-26 18:36:33 - 3 - Update nodes list ["10.233.86.243:9312","10.233.121.99:9312"]
May 26, 2023 @ 21:36:33.598localhost - 2023-05-26 18:36:33 - 3 - Save manticore.json {"clusters":{"weox_cluster":{"nodes":"10.233.86.243:9312,10.233.121.99:9312","options":"","indexes":[]}},"indexes":[]}
May 26, 2023 @ 21:36:33.596localhost - 2023-05-26 18:36:33 - 3 - Query: show status
May 26, 2023 @ 21:36:33.589localhost - 2023-05-26 18:36:33 - 3 - Query: show status
May 26, 2023 @ 21:36:33.584localhost - 2023-05-26 18:36:33 - 3 - Pods count 2
May 26, 2023 @ 21:36:33.584localhost - 2023-05-26 18:36:33 - 3 - Non empty conf
May 26, 2023 @ 21:36:33.520localhost - 2023-05-26 18:36:33 - 7 - Manticore json content: {
May 26, 2023 @ 21:36:33.520 "clusters": {
May 26, 2023 @ 21:36:33.520     "weox_cluster": {
May 26, 2023 @ 21:36:33.520         "nodes":    "10.233.65.7:9312,10.233.86.243:9312,10.233.121.99:9312",
May 26, 2023 @ 21:36:33.520         "options":  "",
May 26, 2023 @ 21:36:33.520         "indexes":  []
May 26, 2023 @ 21:36:33.520     }
May 26, 2023 @ 21:36:33.520 },
May 26, 2023 @ 21:36:33.520 "indexes":  {
May 26, 2023 @ 21:36:33.520 }
May 26, 2023 @ 21:36:33.520}
May 26, 2023 @ 21:36:33.516localhost - 2023-05-26 18:36:33 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:33.512localhost - 2023-05-26 18:36:33 - 3 - Replication mode: master-slave
May 26, 2023 @ 21:36:33.3742023-05-26 18:36:33,372 INFO spawned: 'searchd_replica' with pid 37
May 26, 2023 @ 21:36:33.3632023-05-26 18:36:33,362 INFO spawned: 'quorum_recover' with pid 36
May 26, 2023 @ 21:36:32.3522023-05-26 18:36:32,351 INFO supervisord started with pid 34
May 26, 2023 @ 21:36:32.3512023-05-26 18:36:32,351 INFO RPC interface 'supervisor' initialized
May 26, 2023 @ 21:36:32.3512023-05-26 18:36:32,351 CRIT Server 'unix_http_server' running without any HTTP authentication checking
May 26, 2023 @ 21:36:32.3432023-05-26 18:36:32,343 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.
May 26, 2023 @ 21:36:31.924Mount success
May 26, 2023 @ 21:36:31.891Manticore 6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)
May 26, 2023 @ 21:36:21.605localhost - 2023-05-26 18:36:21 - 3 - Query: show status
May 26, 2023 @ 21:36:21.583localhost - 2023-05-26 18:36:21 - 3 - Query: show status
May 26, 2023 @ 21:36:20.604localhost - 2023-05-26 18:36:20 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:20.579localhost - 2023-05-26 18:36:20 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:19.603localhost - 2023-05-26 18:36:19 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:19.578localhost - 2023-05-26 18:36:19 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:18.601localhost - 2023-05-26 18:36:18 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:18.576localhost - 2023-05-26 18:36:18 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:17.574localhost - 2023-05-26 18:36:17 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:16.574localhost - 2023-05-26 18:36:16 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:15.573localhost - 2023-05-26 18:36:15 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:14.571localhost - 2023-05-26 18:36:14 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:13.570localhost - 2023-05-26 18:36:13 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:12.568localhost - 2023-05-26 18:36:12 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:11.567localhost - 2023-05-26 18:36:11 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:10.566localhost - 2023-05-26 18:36:10 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:09.564localhost - 2023-05-26 18:36:09 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:08.563localhost - 2023-05-26 18:36:08 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:07.562localhost - 2023-05-26 18:36:07 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:06.561localhost - 2023-05-26 18:36:06 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:05.559localhost - 2023-05-26 18:36:05 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:04.558localhost - 2023-05-26 18:36:04 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:03.556localhost - 2023-05-26 18:36:03 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:35:54.0352023-05-26 18:35:54,035 INFO stopped: searchd (exit status 0)
May 26, 2023 @ 21:35:54.029Manticore 6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)
May 26, 2023 @ 21:35:54.029Copyright (c) 2001-2016, Andrew Aksyonoff
May 26, 2023 @ 21:35:54.029Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
May 26, 2023 @ 21:35:54.029Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)
May 26, 2023 @ 21:35:54.029
May 26, 2023 @ 21:35:54.022[Fri May 26 18:35:54.021 2023] [817] shutdown daemon version '6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)' ...
May 26, 2023 @ 21:35:54.022shutdown daemon version '6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)' ...
May 26, 2023 @ 21:35:54.022[Fri May 26 18:35:54.021 2023] [817] shutdown complete
May 26, 2023 @ 21:35:54.022shutdown complete
May 26, 2023 @ 21:35:53.796WARNING: ShutdownAllAlones will kill 1 threads
May 26, 2023 @ 21:35:53.796[Fri May 26 18:35:53.795 2023] [817] Kill thread 'weox_cluster_repl_0' with id 843, try 1
May 26, 2023 @ 21:35:53.796Kill thread 'weox_cluster_repl_0' with id 843, try 1
May 26, 2023 @ 21:35:53.796[Fri May 26 18:35:53.795 2023] [817] WARNING: ShutdownAllAlones will kill 1 threads
May 26, 2023 @ 21:35:53.778[Fri May 26 18:35:53.776 2023] [817] caught SIGTERM, shutting down
May 26, 2023 @ 21:35:53.778caught SIGTERM, shutting down
May 26, 2023 @ 21:35:53.7772023-05-26 18:35:53,776 INFO waiting for searchd to stop
May 26, 2023 @ 21:35:48.581localhost - 2023-05-26 18:35:48 - 3 - Query: show status
May 26, 2023 @ 21:35:33.588localhost - 2023-05-26 18:35:33 - 3 - Query: show status
May 26, 2023 @ 21:35:18.527localhost - 2023-05-26 18:35:18 - 3 - Query: show status
```

githubmanticore commented 1 year ago

➤ Klim Todrik commented:

Can you try fixes from branch 6.0.5.1

Pay attention that there is no any IP records in /var/lib/manticore/manticore.json in each worker pod. If they still there - scale worker's statefulset to 0 replicas and restore to last replicas value

setor commented 1 year ago

Have tried fixes from branch 6.0.5.1, but cluster not bootstrapped

manticore-worker-0.log manticore-worker-1.log manticore-worker-2.log

and second attempt the same problem: secondrun-manticore-worker-0.log secondrun-manticore-worker-1.log secondrun-manticore-worker-2.log

both times the cluster started from scratch

djklim87 commented 1 year ago

Sorry for the late answer. We still investigate and fix your issue. Now we waiting for a crash fix from our core team. When it will be done we planned to create dev build, what you can test again

ezhan commented 1 year ago

Is bug https://github.com/manticoresoftware/manticoresearch-helm/pull/62 related to this bug? Should we test the new 6.2 version?

tomatolog commented 1 year ago

there was a fix into main repository 4d01e2d4fc2df3baf13cd345ae59bab4e41900d2 that was released at 6.2.0 version

It fixed daemon shutdown on DNS resolution failure - that issue was discovered on helm chart scale up. You could try 6.2.0 release to make sure this issue also fixed.

djklim87 commented 1 year ago

Is bug #62 related to this bug? Should we test the new 6.2 version?

No, this was another fix

djklim87 commented 1 year ago

is blocked by https://github.com/manticoresoftware/dev/issues/530

ezhan commented 1 year ago

is blocked by https://github.com/manticoresoftware/dev/issues/530

Wrong URL...

tomatolog commented 1 year ago

that is a private project that aggregates all other projects issues that is why you do not have access to it. However I will inform you when push the fix of daemon into main repository

tomatolog commented 1 year ago

should be fixed at https://github.com/manticoresoftware/manticoresearch/commit/fde612abe64280ace77f6823ffbaf74a04c40e59

Now daemon should start well even some pods missed and should also join well to cluster with some pods unviable. You could checkout any package after that to get issue fixed.

djklim87 commented 1 year ago

@ezhan Please check new fixes in 6.0.5.1 branch. I guess it's fix your problem now

setor commented 1 year ago

@djklim87 , thank you. we tested 6.0.5.1 branch 5 days on production.

cluster breakdown after removing/restaring POD's was fixed

overall all working OK

please notify when the fix will be released in 6.2

Findings: 1) sometimes there are warnings on all workers nodes logs. does not affect operation: [Wed Aug 23 17:04:20.820 2023] [92] WARNING: conn 10.20.20.23:55258(2607902), sock=419: bailing on failed MySQL header, AsyncNetInputBuffer_c::AppendData: error 11 (Resource temporarily unavailable) return -1 WARNING: conn 10.20.20.23:55258(2607902), sock=419: bailing on failed MySQL header, AsyncNetInputBuffer_c::AppendData: error 11 (Resource temporarily unavailable) return -1 [Wed Aug 23 17:04:20.820 2023] [92] WARNING: send() failed: 32: Broken pipe, sock=419 WARNING: send() failed: 32: Broken pipe, sock=419

2) one node restart takes 5 minutes. replication is restored automatically. (11 clustered tables, 40MB of binlogs, 5GB of data)

not tried to restart more than 1 node simultaneously and not tried to add new nodes to the existing cluster

setor commented 1 year ago

Log from worker node after POD recreating:
manticore-worker-0.log (3MB) manticore-worker-0.zip (52KB)

sanikolaev commented 1 year ago

bailing on failed MySQL header, AsyncNetInputBuffer_c::AppendData: error 11 (Resource temporarily unavailable) return -1

This warning was moved to the debug level in release 6.2.12.

djklim87 commented 1 year ago

Please, try again the latest fixes from 6.0.5.1 branch, there included updates to 6.2.12 MS daemon

setor commented 1 year ago

@djklim87, have tested latest changes from branch 6.0.5.1 - working, but need more time to test on production. one pod is restarted last night and produced 3MB of logs in 3 minutes (PHP stacktraces) it will kill our Elasticsearch in a few hours... please reduce the number of lines from PHP stacktrace if possible. log file attached (logs from POD after restart. logs before restart are missing because left space on Elasticsearch)

one of the manticore POD's produced >30GB of logs in a few hours :( but I'm not sure if it's from the latest deployment or from a previous deployment because I'm running both versions in parallel in separate namespaces.

errors from client library on website during POD restart (client connected to the manticore balancer):

Sep 1, 2023 @ 23:16:21.567"index products_4: agent 10.233.86.166:9312: connect and query timed out"
Sep 1, 2023 @ 23:16:21.152"index products_1: agent 10.233.86.166:9312: connect and query timed out"
Sep 1, 2023 @ 23:16:20.681"index products_1: agent 10.233.86.166:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 1, 2023 @ 23:16:20.681"index products_1: agent 10.233.86.166:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 1, 2023 @ 23:16:20.681"index products_4: agent 10.233.86.166:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 1, 2023 @ 23:16:20.304"index products_4: agent 10.233.86.166:9312: connect and query timed out"
Sep 1, 2023 @ 23:16:20.241"index products_4: agent 10.233.86.166:9312: connect and query timed out"

Truncated log from POD after restart (restarted by itself last night), full log also attached below:


...

#181 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(273): Core\Manticore\ManticoreConnector->query()
#182 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(63): Core\Manticore\ManticoreConnector->fetch()
#183 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(92): Core\Manticore\ManticoreConnector->getStatus()
#184 /etc/manticoresearch/replica.php(100): Core\Manticore\ManticoreConnector->checkClusterName()
#185 {main}
localhost - 2023-09-01 04:40:28 - 3 - Error until query processing. Query: show status
. Error: MySQL server has gone away
localhost - 2023-09-01 04:40:29 - 3 - Query: show status
localhost - 2023-09-01 04:40:29 - 3 - Exception until query processing. Query: show status
. Error: mysqli_sql_exception: MySQL server has gone away in /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php:218
Stack trace:
#0 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(218): mysqli->query()
#1 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(237): Core\Manticore\ManticoreConnector->query()
#2 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(237): Core\Manticore\ManticoreConnector->query()

....

#182 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(273): Core\Manticore\ManticoreConnector->query()
#183 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(63): Core\Manticore\ManticoreConnector->fetch()
#184 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(92): Core\Manticore\ManticoreConnector->getStatus()
#185 /etc/manticoresearch/replica.php(100): Core\Manticore\ManticoreConnector->checkClusterName()
#186 {main}
PHP Fatal error:  Uncaught RuntimeException: Can't process query show status in /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php:223
Stack trace:
#0 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(237): Core\Manticore\ManticoreConnector->query()

...

#181 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(273): Core\Manticore\ManticoreConnector->query()
#182 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(63): Core\Manticore\ManticoreConnector->fetch()
#183 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(92): Core\Manticore\ManticoreConnector->getStatus()
#184 /etc/manticoresearch/replica.php(100): Core\Manticore\ManticoreConnector->checkClusterName()
#185 {main}
  thrown in /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php on line 223
2023-09-01 04:40:30,145 INFO exited: searchd_replica (exit status 255; not expected)
[Fri Sep  1 06:14:46.195 2023] [369] WARNING: last inactive check more than PT1.5S (3*evs.inactive_check_period) ago (PT1.82897S), skipping check
WARNING: last inactive check more than PT1.5S (3*evs.inactive_check_period) ago (PT1.82897S), skipping check
[Fri Sep  1 07:44:49.903 2023] [359] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled
[Fri Sep  1 07:44:49.915 2023] [359] rt: table products_4: diskchunk 7(8), segments 32  saved in 8.825355 (8.839145) sec, RAM saved/new 127399195/2151830 ratio 0.950000 (soft limit 127506841, conf limit 134217728)
rt: table products_4: diskchunk 7(8), segments 32  saved in 8.825355 (8.839145) sec, RAM saved/new 127399195/2151830 ratio 0.950000 (soft limit 127506841, conf limit 134217728)
[Fri Sep  1 07:47:18.291 2023] [363] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled

full log worker.zip 53 KB

setor commented 1 year ago

and full log after worker-0 manual restart (latest version from branch 6.0.5.1):

Columnar version mismatch
--2023-09-01 20:16:56--  https://repo.manticoresearch.com/repository/manticoresearch_jammy/dists/jammy/main/binary-amd64/manticore-columnar-lib_2.2.4-230822-5aec342_amd64.deb
Resolving repo.manticoresearch.com (repo.manticoresearch.com)... 49.12.119.254
Connecting to repo.manticoresearch.com (repo.manticoresearch.com)|49.12.119.254|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1641468 (1.6M) [application/octet-stream]
Saving to: ‘/tmp/manticore-columnar-lib_2.2.4-230822-5aec342_amd64.deb’

     0K .......... .......... .......... .......... ..........  3% 1.06M 1s
    50K .......... .......... .......... .......... ..........  6% 1.10M 1s
   100K .......... .......... .......... .......... ..........  9% 14.0M 1s
   150K .......... .......... .......... .......... .......... 12% 1.17M 1s
   200K .......... .......... .......... .......... .......... 15% 15.4M 1s
   250K .......... .......... .......... .......... .......... 18% 15.6M 1s
   300K .......... .......... .......... .......... .......... 21% 16.0M 1s
   350K .......... .......... .......... .......... .......... 24% 1.23M 1s
   400K .......... .......... .......... .......... .......... 28% 15.6M 0s
   450K .......... .......... .......... .......... .......... 31% 16.6M 0s
   500K .......... .......... .......... .......... .......... 34% 16.1M 0s
   550K .......... .......... .......... .......... .......... 37% 16.4M 0s
   600K .......... .......... .......... .......... .......... 40% 16.9M 0s
   650K .......... .......... .......... .......... .......... 43% 16.4M 0s
   700K .......... .......... .......... .......... .......... 46% 1.99M 0s
   750K .......... .......... .......... .......... .......... 49% 11.9M 0s
   800K .......... .......... .......... .......... .......... 53% 15.5M 0s
   850K .......... .......... .......... .......... .......... 56% 15.3M 0s
   900K .......... .......... .......... .......... .......... 59% 15.6M 0s
   950K .......... .......... .......... .......... .......... 62% 14.7M 0s
  1000K .......... .......... .......... .......... .......... 65% 15.6M 0s
  1050K .......... .......... .......... .......... .......... 68% 15.5M 0s
  1100K .......... .......... .......... .......... .......... 71% 15.6M 0s
  1150K .......... .......... .......... .......... .......... 74% 12.3M 0s
  1200K .......... .......... .......... .......... .......... 77% 15.7M 0s
  1250K .......... .......... .......... .......... .......... 81% 15.6M 0s
  1300K .......... .......... .......... .......... .......... 84% 6.89M 0s
  1350K .......... .......... .......... .......... .......... 87% 14.1M 0s
  1400K .......... .......... .......... .......... .......... 90% 15.8M 0s
  1450K .......... .......... .......... .......... .......... 93% 13.5M 0s
  1500K .......... .......... .......... .......... .......... 96% 15.6M 0s
  1550K .......... .......... .......... .......... .......... 99% 14.3M 0s
  1600K ..                                                    100% 5.58T=0.3s

2023-09-01 20:17:01 (5.45 MB/s) - ‘/tmp/manticore-columnar-lib_2.2.4-230822-5aec342_amd64.deb’ saved [1641468/1641468]

Mount success
2023-09-01 20:17:02,848 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.
2023-09-01 20:17:02,858 INFO RPC interface 'supervisor' initialized
2023-09-01 20:17:02,858 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-09-01 20:17:02,859 INFO supervisord started with pid 50
2023-09-01 20:17:03,865 INFO spawned: 'quorum_recover' with pid 51
2023-09-01 20:17:03,870 INFO spawned: 'searchd_replica' with pid 52
localhost - 2023-09-01 20:17:04 - 3 - Replication mode: multi-master
localhost - 2023-09-01 20:17:04 - 7 - Manticore json content: {
    "clusters": {
        "weox_cluster": {
            "nodes":    "manticore-worker-1.manticore-worker-svc.manticore.svc.cluster.local:9312,manticore-worker-2.manticore-worker-svc.manticore.svc.cluster.local:9312",
            "options":  "",
            "indexes":  ["products_1", "products_10", "products_11", "products_2", "products_3", "products_4", "products_5", "products_6", "products_7", "products_8", "products_9"]
        }
    },
    "indexes":  {
        "products_1":   {
            "type": "rt",
            "path": "products_1"
        },
        "products_2":   {
            "type": "rt",
            "path": "products_2"
        },
        "products_3":   {
            "type": "rt",
            "path": "products_3"
        },
        "products_4":   {
            "type": "rt",
            "path": "products_4"
        },
        "products_5":   {
            "type": "rt",
            "path": "products_5"
        },
        "products_6":   {
            "type": "rt",
            "path": "products_6"
        },
        "products_7":   {
            "type": "rt",
            "path": "products_7"
        },
        "products_8":   {
            "type": "rt",
            "path": "products_8"
        },
        "products_9":   {
            "type": "rt",
            "path": "products_9"
        },
        "products_10":  {
            "type": "rt",
            "path": "products_10"
        },
        "products_11":  {
            "type": "rt",
            "path": "products_11"
        }
    }
}
localhost - 2023-09-01 20:17:04 - 3 - Pods count 2
localhost - 2023-09-01 20:17:04 - 3 - Non empty conf
localhost - 2023-09-01 20:17:04 - 3 - Query: show status
localhost - 2023-09-01 20:17:04 - 3 - Query: show status
localhost - 2023-09-01 20:17:04 - 3 - Update nodes list ["manticore-worker-1.manticore-worker-svc.manticore.svc.cluster.local:9312","manticore-worker-2.manticore-worker-svc.manticore.svc.cluster.local:9312"]
2023-09-01 20:17:04,821 INFO spawned: 'searchd' with pid 57
[Fri Sep  1 20:17:04.856 2023] [57] WARNING: Error initializing columnar storage: MCL requires a CPU that supports SSE 4.2
[Fri Sep  1 20:17:04.856 2023] [57] WARNING: Error initializing secondary index: MCL requires a CPU that supports SSE 4.2
[Fri Sep  1 20:17:04.868 2023] [57] using config file '/etc/manticoresearch/manticore.conf' (606 chars)...
2023-09-01 20:17:04,869 INFO success: quorum_recover entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-09-01 20:17:04,869 INFO success: searchd_replica entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
[Fri Sep  1 20:17:04.873 2023] [57] WARNING: secondary_indexes set but failed to initialize secondary library: MCL requires a CPU that supports SSE 4.2
[Fri Sep  1 20:17:04.894 2023] [57] starting daemon version '6.2.12 dc5144d35@230822' ...
starting daemon version '6.2.12 dc5144d35@230822' ...
[Fri Sep  1 20:17:04.894 2023] [57] listening on UNIX socket /var/run/mysqld/mysqld.sock
listening on UNIX socket /var/run/mysqld/mysqld.sock
[Fri Sep  1 20:17:04.895 2023] [57] listening on all interfaces for mysql, port=9306
listening on all interfaces for mysql, port=9306
[Fri Sep  1 20:17:04.895 2023] [57] listening on all interfaces for sphinx and http(s), port=9308
listening on all interfaces for sphinx and http(s), port=9308
[Fri Sep  1 20:17:04.895 2023] [57] listening on all interfaces for VIP mysql, port=9301
listening on all interfaces for VIP mysql, port=9301
[Fri Sep  1 20:17:04.895 2023] [57] listening on 10.233.86.91:9312 for sphinx and http(s)
listening on 10.233.86.91:9312 for sphinx and http(s)
Manticore 6.2.12 dc5144d35@230822
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)

precaching table 'products_1'
[Fri Sep  1 20:17:04.993 2023] [61] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled
2023-09-01 20:17:05,996 INFO success: searchd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
localhost - 2023-09-01 20:17:06 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_2'
localhost - 2023-09-01 20:17:07 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_3'
localhost - 2023-09-01 20:17:08 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_4'
[Fri Sep  1 20:17:08.537 2023] [63] WARNING: last message repeated 12 times
WARNING: last message repeated 12 times
[Fri Sep  1 20:17:08.537 2023] [63] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled
localhost - 2023-09-01 20:17:09 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_5'
[Fri Sep  1 20:17:10.042 2023] [63] WARNING: last message repeated 11 times
WARNING: last message repeated 11 times
[Fri Sep  1 20:17:10.042 2023] [63] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled
localhost - 2023-09-01 20:17:10 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_6'
localhost - 2023-09-01 20:17:11 - 3 - Manticore connect exception (localhost:9306) Connection refused
localhost - 2023-09-01 20:17:12 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_7'
precaching table 'products_8'
[Fri Sep  1 20:17:12.805 2023] [66] WARNING: last message repeated 9 times
WARNING: last message repeated 9 times
[Fri Sep  1 20:17:12.805 2023] [66] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled
localhost - 2023-09-01 20:17:13 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_9'
precaching table 'products_10'
precaching table 'products_11'
[Fri Sep  1 20:17:13.860 2023] [72] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled
localhost - 2023-09-01 20:17:14 - 3 - Manticore connect exception (localhost:9306) Connection refused
[Fri Sep  1 20:17:14.563 2023] [68] prereading 11 tables
prereading 11 tables
localhost - 2023-09-01 20:17:15 - 3 - Wait until manticore-worker-0 came alive
[Fri Sep  1 20:17:15.677 2023] [57] accepting connections
accepting connections
[Fri Sep  1 20:17:15.966 2023] [72] [BUDDY] started v1.0.18 '/usr/share/manticore/modules/manticore-buddy/bin/manticore-buddy --listen=http://0.0.0.0:9308  --threads=16' at http://127.0.0.1:45699
[BUDDY] started v1.0.18 '/usr/share/manticore/modules/manticore-buddy/bin/manticore-buddy --listen=http://0.0.0.0:9308  --threads=16' at http://127.0.0.1:45699
localhost - 2023-09-01 20:17:16 - 3 - Query: show status
localhost - 2023-09-01 20:17:16 - 3 - Cluster exist
localhost - 2023-09-01 20:17:16 - 3 - Query: show tables
localhost - 2023-09-01 20:17:16 - 3 - Call balancer 10.233.86.48:8080. Response: localhost - 2023-09-01 20:17:16 - 3 - Another process observer is already running

localhost - 2023-09-01 20:17:16 - 3 - Call balancer 10.233.65.63:8080. Response: 
localhost - 2023-09-01 20:17:16 - 3 - Call balancer 10.233.86.23:8080. Response: localhost - 2023-09-01 20:17:16 - 3 - Another process observer is already running

[Fri Sep  1 20:17:16.340 2023] [64] [BUDDY] Loaded plugins:
[BUDDY] Loaded plugins:
[Fri Sep  1 20:17:16.340 2023] [64] [BUDDY]   core: empty-string, backup, emulate-elastic, insert, select, show, cli-table, plugin, test, insert-mva
[BUDDY]   core: empty-string, backup, emulate-elastic, insert, select, show, cli-table, plugin, test, insert-mva
[Fri Sep  1 20:17:16.340 2023] [64] [BUDDY]   local: 
[BUDDY]   local: 
[Fri Sep  1 20:17:16.340 2023] [64] [BUDDY]   extra: 
[BUDDY]   extra: 
2023-09-01 20:17:16,340 INFO exited: searchd_replica (exit status 0; expected)
[Fri Sep  1 20:17:33.515 2023] [68] preread 11 tables in 18.952 sec
preread 11 tables in 18.952 sec

setor commented 1 year ago

full log after worker-1 manual restart (latest version from branch 6.0.5.1)

client logs:

Sep 2, 2023 @ 00:19:55.893"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:55.614"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:54.988"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:53.327"index products_8: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:53.143"index products_8: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:53.019"index products_5: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:52.198"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:51.748"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:51.038"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:49.537"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:48.351"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:47.195"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:47.168"index products_4: agent 10.233.121.76:9312: agent closed connection"
Sep 2, 2023 @ 00:19:29.283"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:28.452"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:28.396"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:27.670"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:27.646"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:27.641"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:27.587"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:26.023"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:18:50.298"index products_4: agent 10.233.121.149:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 2, 2023 @ 00:18:49.896"index products_4: agent 10.233.121.149:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 2, 2023 @ 00:18:49.895"index products_4: agent 10.233.121.149:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 2, 2023 @ 00:18:49.894"index products_1: agent 10.233.121.149:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 2, 2023 @ 00:18:49.721"index products_11: agent 10.233.121.149:9312: query timed out"

POD logs 3MB of PHP stacktraces in 3 minutes :(

worker-1-manual-restart.zip 50KB archive

Mokto commented 1 year ago

Also tried with the 6.0.5.1 branch. Both on a fresh cluster and one that was existing.

I can't get any worker node to start.

worker.txt

djklim87 commented 1 year ago

Also tried with the 6.0.5.1 branch. Both on a fresh cluster and one that was existing.

I can't get any worker node to start.

worker.txt

[Tue Sep  5 13:13:16.298 2023] [85] FATAL: no AF_INET address found for: $ip, error -2: Name or service not known

It's strange to see this line, can you chow content of your worker's configmap?

Mokto commented 1 year ago

My bad, didn't see the change in the configmap on that branch:

          listen = $hostname:9312
          listen = $hostname:9315-9415:replication
          node_address = $hostname

Works now. I will try if the cluster breakdown issue still happens.

Thanks!

Mokto commented 1 year ago

Seems like nodes can't join the cluster:

localhost - 2023-09-06 10:13:27 - 3 - Exception until query processing. Query: JOIN CLUSTER opengeocoding_cluster at 'manticoresearch-worker-0.manticoresearch-worker-svc:9312'
. Error: mysqli_sql_exception: cluster 'opengeocoding_cluster', no nodes available(manticoresearch-worker-0.manticoresearch-worker-svc:9312), error: unknown listen protocol type 'rep' in /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php:218

Full logs: worker-new.txt

Config:

        searchd
        {
          listen = /var/run/mysqld/mysqld.sock:mysql41
          listen = 9306:mysql41
          listen = 9308:http
          listen = 9301:mysql_vip
          listen = $hostname:9312
          listen = $hostname:9315-9415:replication
          node_address = $hostname
          binlog_path = /var/lib/manticore
          log = /dev/stdout
          query_log = /dev/stdout
          query_log_format = sphinxql
          pid_file = /var/run/manticore/searchd.pid
          data_dir = /var/lib/manticore
          shutdown_timeout = 25s
          auto_optimize = 1
          auto_schema = 0
          optimize_cutoff = 4
        }
        indexer
        {
                mem_limit = 2048M
                max_iops = 3000
                max_iosize = 200M
        }

Note: every pod is up and running. I can run CREATE TABLE queries but queries to opengeocoding_cluster are returning an error saying the cluster doesn't exist

tomatolog commented 1 year ago

could you also change your daemon start cli by adding --logreplication then provide searchd.log from all pods again

djklim87 commented 1 year ago

could you also change your daemon start cli by adding --logreplication then provide searchd.log from all pods again

I'll build it today

djklim87 commented 1 year ago

I'll build it today

Already done. Branch still 6.0.5.1.

Please don't forget that you should to use ImagePullPolicy: Always during our tries

Mokto commented 1 year ago

It seems like 6.2.12.1 has only been updated 11 days ago https://hub.docker.com/r/manticoresearch/helm-worker/tags

Should we try with the 6.2.12.0 image you uploaded?

djklim87 commented 1 year ago

It seems like 6.2.12.1 has only been updated 11 days ago https://hub.docker.com/r/manticoresearch/helm-worker/tags

Should we try with the 6.2.12.0 image you uploaded?

Ohh, sorry, fat fingers) Ready in 6.2.12.1

Mokto commented 1 year ago

I think the quotes are not properly escaped :

Query: JOIN CLUSTER opengeocoding_cluster at 'manticoresearch-worker-0.manticoresearch-worker-svc:9312'
[Tue Sep 12 04:21:41.474 2023] [81] DEBUG: P01: syntax error, unexpected identifier near 'JOIN CLUSTER opengeocoding_cluster at 'manticoresearch-worker-0.manticoresearch-worker-svc:9312''

worker.txt

djklim87 commented 1 year ago

I think the quotes are not properly escaped :

Query: JOIN CLUSTER opengeocoding_cluster at 'manticoresearch-worker-0.manticoresearch-worker-svc:9312'
[Tue Sep 12 04:21:41.474 2023] [81] DEBUG: P01: syntax error, unexpected identifier near 'JOIN CLUSTER opengeocoding_cluster at 'manticoresearch-worker-0.manticoresearch-worker-svc:9312''

worker.txt

No, because I tested this build on minikibe and at our cluster. Attach please logs from worker-0 also

Mokto commented 1 year ago

It has a similar issue:

 Query: CREATE CLUSTER opengeocoding_cluster
[Wed Sep 13 13:02:26.082 2023] [81] DEBUG: P01: syntax error, unexpected CREATE near 'CREATE CLUSTER opengeocoding_cluster'
DEBUG: P01: syntax error, unexpected CREATE near 'CREATE CLUSTER opengeocoding_cluster'

worker-1.txt worker-0.txt

djklim87 commented 1 year ago

And attach please your worker configmap

kubectl -n manticore-helm get cm manticore-helm-manticoresearch-worker-config -o yaml

tomatolog commented 1 year ago

not sure that is the issue but seems node name is too long, ie from log you provided manticoresearch-worker-0.manticoresearch-worker-svc.databases-production-manticoresearch-opengeocoding.svc.cluster.local:9312,manticoresearch-worker-0.manticoresearch-worker-svc.databases-production-manticoresearch-opengeocoding.svc.cluster.local:9315:replication

this single node name is ok - but when cluster have multiple nodes these put into single string that string could be cut at 1024 chars

Could you check that short name of the pod \ node will not got this issue?

tomatolog commented 1 year ago

seems replication provider Galera has also its own restriction on node individual name to 256 chars

https://github.com/codership/wsrep-API/blob/e988d99c8619d91dd32fda3e66eb9b098eb0cfc8/wsrep_api.h#L235 https://github.com/codership/wsrep-API/blob/e988d99c8619d91dd32fda3e66eb9b098eb0cfc8/wsrep_api.h#L322

Mokto commented 1 year ago

Seems like it worked 👍

Mokto commented 1 year ago

Thanks! I'll now test if the issue still happens.

Mokto commented 1 year ago

It seems to work really well under load & after restart.

I think there are some improvements to be made on the health probes to mark the node as down if it's not part of the cluster (I had to manually restart the node) but I guess it's a different issue!

Thanks a lot!

tomatolog commented 1 year ago

could you provide diff of changes in your config to make sure what exactly helps you?

Mokto commented 1 year ago

I just reduced the number of characters from my pods.

Mokto commented 1 year ago

Seems like my cluster broke down after a few days. Out of 5 nodes 2 are working but the third can't start.

Note: i have quorumRecovery = true & replicationMode: multi-master

worker-2.txt

It seems to complain about binlog files missing:

FATAL: binlog: log open error: failed to open /var/lib/manticore/binlog.009: No such file or directory

Mokto commented 1 year ago

My only way to fix the cluster was to delete the persistent volumes and restart the nodes.

Seems like they are getting the data back slowly now.

Mokto commented 1 year ago

Sorry for the spam.

Now I realize that I do get the unknown cluster 'opengeocoding_cluster'" issue again.

djklim87 commented 1 year ago

Sorry for the spam.

Now I realize that I do get the unknown cluster 'opengeocoding_cluster'" issue again.

1) I need more logs. 2) Is cluster already exist on worker-0??

Mokto commented 1 year ago

Yes the cluster seems to exist on worker-0.

worker-0.txt worker-2.txt

FYI:

0 & 1 have always worked
Then 2 - 3 - 4 were red as I explained before. See https://github.com/manticoresoftware/manticoresearch-helm/issues/59#issuecomment-1720906150
I then deleted the PVCs of 2 -3 4 and restarted the nodes

Mokto commented 1 year ago

I did more debugging:

Trying to run this command:

REPLACE INTO opengeocoding_cluster:openstreetdata_houses(id,street,number,unit,city,district,region,postcode,lat,long,country_code) VALUES (4347932550137613764,'Av. Sant Joan de Caselles','15','','Canillo','','Canillo','AD100',42.5662011,1.5987579, 'ad');

By connecting directly to each worker node:

Actually workers 0, 2, 3, 4 return success, worker 1 return:

Error 1064 (42000): table 'openstreetdata_houses' is not in any cluster, use just 'openstreetdata_houses'

worker-1.txt

I also tried to restart worker 1 but same issue.

I assume that if I delete the corresponding PVC and restart worker 1, it'll start working again.

tomatolog commented 1 year ago

from logs you provided I see that node1 executed

localhost - 2023-09-15 12:48:57 - 3 - Query: JOIN CLUSTER opengeocoding_cluster at 'manticoresearch-worker-3.manticoresearch-worker-svc:9312'

and its result is unknown as there is no verbosity log at the node

from logs at node0 and node2 result is also unknown as the last messages at these nodes are earlier that from node1
at the node0

[Fri Sep 15 12:09:42.642 2023] [50] DEBUG: Preread successfully finished

and at the node2

[Fri Sep 15 12:11:17.279 2023] [47] DEBUG: Preread successfully finished

I need logs from all nodes with --logreplicaion enabled to check all replication events and they should the same timestamp in logs to line up events from one node to events to all other nodes

Or you might provide reproducible case that we could debug here locally.

I also see at the logs files there are still long names for nodes - not sure is this relates to the issue or nor as the root of the issue is still not clear.

djklim87 commented 1 year ago

I already built an image with --logreplicaion so you need to restart your pods with ImagePullPolicy: Always
Let's don't change conditions, because first you used master-master mode, now master-slave
I don't know why binlog not found (but I am faced with this issue sometimes, and the solution is just to clear binlog folder)
I guess now the binlog prevents cluster start. 3 from 5 nodes down so there is no Quorum. And quorum fixer there doesn't help you..

Mokto commented 1 year ago

Pretty sure it was started with ImagePullPolicy: Always
I am using multi-master and always have

For now my cluster is working as intented so I can't provide anything additional but will do when/if this happens again.

Mokto commented 1 year ago

@djklim87 FATAL: malformed or unknown option near '--logreplicaion'; getting this now.

Seems like there is a typo, it should be --logreplication

manticoresoftware / manticoresearch-helm

Manticore cluster breakdown #59

Description

How to reproduce