Closed setor closed 12 months ago
Screenshot with amount of logs from Manticore after node restart
Logs from POD manticore-worker-0 after restart (removed lines with ManticoreConnector.php)
I'm interested in the line Pods count 2
- there is total 3 pods, but one is not yet ready because current node is starting up and k8s healthcheck is not yet activated this node and probably there is an issue...
May 26, 2023 @ 21:36:35.629#3 {main}
May 26, 2023 @ 21:36:35.629localhost - 2023-05-26 18:36:35 - 3 - Error until query processing. Query: JOIN CLUSTER weox_cluster at 'manticore-worker-0.manticore-worker-svc:9312'
May 26, 2023 @ 21:36:35.629. Error: cluster 'weox_cluster', no nodes available(manticore-worker-0.manticore-worker-svc:9312), error: manticore-worker-0.manticore-worker-svc:9312 invalid node, no AF_INET address found for: manticore-worker-0.manticore-worker-svc
May 26, 2023 @ 21:36:35.427wsrep loader: [WARN] wsrep_unload(): null pointer.
May 26, 2023 @ 21:36:35.422localhost - 2023-05-26 18:36:35 - 3 - Query: JOIN CLUSTER weox_cluster at 'manticore-worker-0.manticore-worker-svc:9312'
May 26, 2023 @ 21:36:35.421localhost - 2023-05-26 18:36:35 - 3 - Query: show status
May 26, 2023 @ 21:36:35.421localhost - 2023-05-26 18:36:35 - 3 - Join to manticore-worker-0
May 26, 2023 @ 21:36:35.375[Fri May 26 18:36:35.374 2023] [47] [BUDDY] started '/usr/share/manticore/modules/manticore-buddy --listen=http://0.0.0.0:9308 --threads=16' at http://127.0.0.1:40821
May 26, 2023 @ 21:36:35.375[BUDDY] started '/usr/share/manticore/modules/manticore-buddy --listen=http://0.0.0.0:9308 --threads=16' at http://127.0.0.1:40821
May 26, 2023 @ 21:36:35.3752023-05-26 18:36:35,375 INFO success: searchd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
May 26, 2023 @ 21:36:34.5192023-05-26 18:36:34,518 INFO success: searchd_replica entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
May 26, 2023 @ 21:36:34.5182023-05-26 18:36:34,518 INFO success: quorum_recover entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
May 26, 2023 @ 21:36:34.518localhost - 2023-05-26 18:36:34 - 3 - Query: show status
May 26, 2023 @ 21:36:34.194[Fri May 26 18:36:34.192 2023] [43] accepting connections
May 26, 2023 @ 21:36:34.194accepting connections
May 26, 2023 @ 21:36:34.165wsrep loader: [WARN] wsrep_unload(): null pointer.
May 26, 2023 @ 21:36:34.164[Fri May 26 18:36:34.163 2023] [55] WARNING: cluster 'weox_cluster': invalid nodes '10.233.121.99:9315,10.233.86.243:9315'(10.233.86.243:9312,10.233.121.99:9312), replication is disabled, error: no AF_INET address found for: manticore-worker-0.manticore-worker-svc
May 26, 2023 @ 21:36:34.164WARNING: cluster 'weox_cluster': invalid nodes '10.233.121.99:9315,10.233.86.243:9315'(10.233.86.243:9312,10.233.121.99:9312), replication is disabled, error: no AF_INET address found for: manticore-worker-0.manticore-worker-svc
May 26, 2023 @ 21:36:34.147[Fri May 26 18:36:34.146 2023] [50] prereading 0 tables
May 26, 2023 @ 21:36:34.147prereading 0 tables
May 26, 2023 @ 21:36:34.147[Fri May 26 18:36:34.146 2023] [50] preread 0 tables in 0.000 sec
May 26, 2023 @ 21:36:34.147preread 0 tables in 0.000 sec
May 26, 2023 @ 21:36:34.143binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143binlog: replaying log /var/lib/manticore/data/binlog.001
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.142 2023] [49] binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143binlog: finished replaying /var/lib/manticore/data/binlog.001; 0.0 MB in 0.000 sec
May 26, 2023 @ 21:36:34.143[Fri May 26 18:36:34.143 2023] [49] binlog: finished replaying total 4 in 0.000 sec
May 26, 2023 @ 21:36:34.143binlog: finished replaying total 4 in 0.000 sec
May 26, 2023 @ 21:36:34.108[Fri May 26 18:36:34.107 2023] [43] listening on 10.233.65.7:9312 for sphinx and http(s)
May 26, 2023 @ 21:36:34.108[Fri May 26 18:36:34.107 2023] [43] listening on all interfaces for sphinx and http(s), port=9308
May 26, 2023 @ 21:36:34.108listening on all interfaces for sphinx and http(s), port=9308
May 26, 2023 @ 21:36:34.108[Fri May 26 18:36:34.107 2023] [43] listening on all interfaces for VIP mysql, port=9301
May 26, 2023 @ 21:36:34.108listening on all interfaces for VIP mysql, port=9301
May 26, 2023 @ 21:36:34.108listening on 10.233.65.7:9312 for sphinx and http(s)
May 26, 2023 @ 21:36:34.107listening on all interfaces for mysql, port=9306
May 26, 2023 @ 21:36:34.107[Fri May 26 18:36:34.106 2023] [43] starting daemon version '6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)' ...
May 26, 2023 @ 21:36:34.107starting daemon version '6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)' ...
May 26, 2023 @ 21:36:34.107[Fri May 26 18:36:34.107 2023] [43] listening on UNIX socket /var/run/mysqld/mysqld.sock
May 26, 2023 @ 21:36:34.107listening on UNIX socket /var/run/mysqld/mysqld.sock
May 26, 2023 @ 21:36:34.107[Fri May 26 18:36:34.107 2023] [43] listening on all interfaces for mysql, port=9306
May 26, 2023 @ 21:36:34.098[Fri May 26 18:36:34.096 2023] [43] using config file '/etc/manticoresearch/manticore.conf' (414 chars)...
May 26, 2023 @ 21:36:34.0502023-05-26 18:36:34,049 INFO spawned: 'searchd' with pid 43
May 26, 2023 @ 21:36:33.598localhost - 2023-05-26 18:36:33 - 3 - Update nodes list ["10.233.86.243:9312","10.233.121.99:9312"]
May 26, 2023 @ 21:36:33.598localhost - 2023-05-26 18:36:33 - 3 - Save manticore.json {"clusters":{"weox_cluster":{"nodes":"10.233.86.243:9312,10.233.121.99:9312","options":"","indexes":[]}},"indexes":[]}
May 26, 2023 @ 21:36:33.596localhost - 2023-05-26 18:36:33 - 3 - Query: show status
May 26, 2023 @ 21:36:33.589localhost - 2023-05-26 18:36:33 - 3 - Query: show status
May 26, 2023 @ 21:36:33.584localhost - 2023-05-26 18:36:33 - 3 - Pods count 2
May 26, 2023 @ 21:36:33.584localhost - 2023-05-26 18:36:33 - 3 - Non empty conf
May 26, 2023 @ 21:36:33.520localhost - 2023-05-26 18:36:33 - 7 - Manticore json content: {
May 26, 2023 @ 21:36:33.520 "clusters": {
May 26, 2023 @ 21:36:33.520 "weox_cluster": {
May 26, 2023 @ 21:36:33.520 "nodes": "10.233.65.7:9312,10.233.86.243:9312,10.233.121.99:9312",
May 26, 2023 @ 21:36:33.520 "options": "",
May 26, 2023 @ 21:36:33.520 "indexes": []
May 26, 2023 @ 21:36:33.520 }
May 26, 2023 @ 21:36:33.520 },
May 26, 2023 @ 21:36:33.520 "indexes": {
May 26, 2023 @ 21:36:33.520 }
May 26, 2023 @ 21:36:33.520}
May 26, 2023 @ 21:36:33.516localhost - 2023-05-26 18:36:33 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:33.512localhost - 2023-05-26 18:36:33 - 3 - Replication mode: master-slave
May 26, 2023 @ 21:36:33.3742023-05-26 18:36:33,372 INFO spawned: 'searchd_replica' with pid 37
May 26, 2023 @ 21:36:33.3632023-05-26 18:36:33,362 INFO spawned: 'quorum_recover' with pid 36
May 26, 2023 @ 21:36:32.3522023-05-26 18:36:32,351 INFO supervisord started with pid 34
May 26, 2023 @ 21:36:32.3512023-05-26 18:36:32,351 INFO RPC interface 'supervisor' initialized
May 26, 2023 @ 21:36:32.3512023-05-26 18:36:32,351 CRIT Server 'unix_http_server' running without any HTTP authentication checking
May 26, 2023 @ 21:36:32.3432023-05-26 18:36:32,343 CRIT Supervisor is running as root. Privileges were not dropped because no user is specified in the config file. If you intend to run as root, you can set user=root in the config file to avoid this message.
May 26, 2023 @ 21:36:31.924Mount success
May 26, 2023 @ 21:36:31.891Manticore 6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)
May 26, 2023 @ 21:36:21.605localhost - 2023-05-26 18:36:21 - 3 - Query: show status
May 26, 2023 @ 21:36:21.583localhost - 2023-05-26 18:36:21 - 3 - Query: show status
May 26, 2023 @ 21:36:20.604localhost - 2023-05-26 18:36:20 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:20.579localhost - 2023-05-26 18:36:20 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:19.603localhost - 2023-05-26 18:36:19 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:19.578localhost - 2023-05-26 18:36:19 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:18.601localhost - 2023-05-26 18:36:18 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:18.576localhost - 2023-05-26 18:36:18 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:17.574localhost - 2023-05-26 18:36:17 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:16.574localhost - 2023-05-26 18:36:16 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:15.573localhost - 2023-05-26 18:36:15 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:14.571localhost - 2023-05-26 18:36:14 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:13.570localhost - 2023-05-26 18:36:13 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:12.568localhost - 2023-05-26 18:36:12 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:11.567localhost - 2023-05-26 18:36:11 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:10.566localhost - 2023-05-26 18:36:10 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:09.564localhost - 2023-05-26 18:36:09 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:08.563localhost - 2023-05-26 18:36:08 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:07.562localhost - 2023-05-26 18:36:07 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:06.561localhost - 2023-05-26 18:36:06 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:05.559localhost - 2023-05-26 18:36:05 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:04.558localhost - 2023-05-26 18:36:04 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:36:03.556localhost - 2023-05-26 18:36:03 - 3 - Manticore connect exception (localhost:9306) Connection refused
May 26, 2023 @ 21:35:54.0352023-05-26 18:35:54,035 INFO stopped: searchd (exit status 0)
May 26, 2023 @ 21:35:54.029Manticore 6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)
May 26, 2023 @ 21:35:54.029Copyright (c) 2001-2016, Andrew Aksyonoff
May 26, 2023 @ 21:35:54.029Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
May 26, 2023 @ 21:35:54.029Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)
May 26, 2023 @ 21:35:54.029
May 26, 2023 @ 21:35:54.022[Fri May 26 18:35:54.021 2023] [817] shutdown daemon version '6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)' ...
May 26, 2023 @ 21:35:54.022shutdown daemon version '6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)' ...
May 26, 2023 @ 21:35:54.022[Fri May 26 18:35:54.021 2023] [817] shutdown complete
May 26, 2023 @ 21:35:54.022shutdown complete
May 26, 2023 @ 21:35:53.796WARNING: ShutdownAllAlones will kill 1 threads
May 26, 2023 @ 21:35:53.796[Fri May 26 18:35:53.795 2023] [817] Kill thread 'weox_cluster_repl_0' with id 843, try 1
May 26, 2023 @ 21:35:53.796Kill thread 'weox_cluster_repl_0' with id 843, try 1
May 26, 2023 @ 21:35:53.796[Fri May 26 18:35:53.795 2023] [817] WARNING: ShutdownAllAlones will kill 1 threads
May 26, 2023 @ 21:35:53.778[Fri May 26 18:35:53.776 2023] [817] caught SIGTERM, shutting down
May 26, 2023 @ 21:35:53.778caught SIGTERM, shutting down
May 26, 2023 @ 21:35:53.7772023-05-26 18:35:53,776 INFO waiting for searchd to stop
May 26, 2023 @ 21:35:48.581localhost - 2023-05-26 18:35:48 - 3 - Query: show status
May 26, 2023 @ 21:35:33.588localhost - 2023-05-26 18:35:33 - 3 - Query: show status
May 26, 2023 @ 21:35:18.527localhost - 2023-05-26 18:35:18 - 3 - Query: show status
```
➤ Klim Todrik commented:
Can you try fixes from branch 6.0.5.1
Pay attention that there is no any IP records in /var/lib/manticore/manticore.json
in each worker pod. If they still there - scale worker's statefulset to 0 replicas and restore to last replicas value
Have tried fixes from branch 6.0.5.1, but cluster not bootstrapped
manticore-worker-0.log manticore-worker-1.log manticore-worker-2.log
and second attempt the same problem: secondrun-manticore-worker-0.log secondrun-manticore-worker-1.log secondrun-manticore-worker-2.log
both times the cluster started from scratch
Sorry for the late answer. We still investigate and fix your issue. Now we waiting for a crash fix from our core team. When it will be done we planned to create dev build, what you can test again
Is bug https://github.com/manticoresoftware/manticoresearch-helm/pull/62 related to this bug? Should we test the new 6.2 version?
there was a fix into main repository 4d01e2d4fc2df3baf13cd345ae59bab4e41900d2 that was released at 6.2.0 version
It fixed daemon shutdown on DNS resolution failure - that issue was discovered on helm chart scale up. You could try 6.2.0 release to make sure this issue also fixed.
Is bug #62 related to this bug? Should we test the new 6.2 version?
No, this was another fix
is blocked by https://github.com/manticoresoftware/dev/issues/530
is blocked by https://github.com/manticoresoftware/dev/issues/530
Wrong URL...
that is a private project that aggregates all other projects issues that is why you do not have access to it. However I will inform you when push the fix of daemon into main repository
should be fixed at https://github.com/manticoresoftware/manticoresearch/commit/fde612abe64280ace77f6823ffbaf74a04c40e59
Now daemon should start well even some pods missed and should also join well to cluster with some pods unviable. You could checkout any package after that to get issue fixed.
@ezhan Please check new fixes in 6.0.5.1
branch. I guess it's fix your problem now
@djklim87 , thank you. we tested 6.0.5.1
branch 5 days on production.
cluster breakdown after removing/restaring POD's was fixed
overall all working OK
please notify when the fix will be released in 6.2
Findings:
1) sometimes there are warnings on all workers nodes logs. does not affect operation:
[Wed Aug 23 17:04:20.820 2023] [92] WARNING: conn 10.20.20.23:55258(2607902), sock=419: bailing on failed MySQL header, AsyncNetInputBuffer_c::AppendData: error 11 (Resource temporarily unavailable) return -1 WARNING: conn 10.20.20.23:55258(2607902), sock=419: bailing on failed MySQL header, AsyncNetInputBuffer_c::AppendData: error 11 (Resource temporarily unavailable) return -1 [Wed Aug 23 17:04:20.820 2023] [92] WARNING: send() failed: 32: Broken pipe, sock=419 WARNING: send() failed: 32: Broken pipe, sock=419
2) one node restart takes 5 minutes. replication is restored automatically. (11 clustered tables, 40MB of binlogs, 5GB of data)
not tried to restart more than 1 node simultaneously and not tried to add new nodes to the existing cluster
Log from worker node after POD recreating:
manticore-worker-0.log (3MB)
manticore-worker-0.zip (52KB)
bailing on failed MySQL header, AsyncNetInputBuffer_c::AppendData: error 11 (Resource temporarily unavailable) return -1
This warning was moved to the debug level in release 6.2.12.
Please, try again the latest fixes from 6.0.5.1
branch, there included updates to 6.2.12 MS daemon
@djklim87, have tested latest changes from branch 6.0.5.1
- working, but need more time to test on production. one pod is restarted last night and produced 3MB of logs in 3 minutes (PHP stacktraces) it will kill our Elasticsearch in a few hours... please reduce the number of lines from PHP stacktrace if possible. log file attached (logs from POD after restart. logs before restart are missing because left space on Elasticsearch)
one of the manticore POD's produced >30GB of logs in a few hours :( but I'm not sure if it's from the latest deployment or from a previous deployment because I'm running both versions in parallel in separate namespaces.
errors from client library on website during POD restart (client connected to the manticore balancer):
Sep 1, 2023 @ 23:16:21.567"index products_4: agent 10.233.86.166:9312: connect and query timed out"
Sep 1, 2023 @ 23:16:21.152"index products_1: agent 10.233.86.166:9312: connect and query timed out"
Sep 1, 2023 @ 23:16:20.681"index products_1: agent 10.233.86.166:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 1, 2023 @ 23:16:20.681"index products_1: agent 10.233.86.166:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 1, 2023 @ 23:16:20.681"index products_4: agent 10.233.86.166:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 1, 2023 @ 23:16:20.304"index products_4: agent 10.233.86.166:9312: connect and query timed out"
Sep 1, 2023 @ 23:16:20.241"index products_4: agent 10.233.86.166:9312: connect and query timed out"
Truncated log from POD after restart (restarted by itself last night), full log also attached below:
...
#181 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(273): Core\Manticore\ManticoreConnector->query()
#182 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(63): Core\Manticore\ManticoreConnector->fetch()
#183 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(92): Core\Manticore\ManticoreConnector->getStatus()
#184 /etc/manticoresearch/replica.php(100): Core\Manticore\ManticoreConnector->checkClusterName()
#185 {main}
localhost - 2023-09-01 04:40:28 - 3 - Error until query processing. Query: show status
. Error: MySQL server has gone away
localhost - 2023-09-01 04:40:29 - 3 - Query: show status
localhost - 2023-09-01 04:40:29 - 3 - Exception until query processing. Query: show status
. Error: mysqli_sql_exception: MySQL server has gone away in /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php:218
Stack trace:
#0 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(218): mysqli->query()
#1 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(237): Core\Manticore\ManticoreConnector->query()
#2 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(237): Core\Manticore\ManticoreConnector->query()
....
#182 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(273): Core\Manticore\ManticoreConnector->query()
#183 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(63): Core\Manticore\ManticoreConnector->fetch()
#184 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(92): Core\Manticore\ManticoreConnector->getStatus()
#185 /etc/manticoresearch/replica.php(100): Core\Manticore\ManticoreConnector->checkClusterName()
#186 {main}
PHP Fatal error: Uncaught RuntimeException: Can't process query show status in /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php:223
Stack trace:
#0 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(237): Core\Manticore\ManticoreConnector->query()
...
#181 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(273): Core\Manticore\ManticoreConnector->query()
#182 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(63): Core\Manticore\ManticoreConnector->fetch()
#183 /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php(92): Core\Manticore\ManticoreConnector->getStatus()
#184 /etc/manticoresearch/replica.php(100): Core\Manticore\ManticoreConnector->checkClusterName()
#185 {main}
thrown in /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php on line 223
2023-09-01 04:40:30,145 INFO exited: searchd_replica (exit status 255; not expected)
[Fri Sep 1 06:14:46.195 2023] [369] WARNING: last inactive check more than PT1.5S (3*evs.inactive_check_period) ago (PT1.82897S), skipping check
WARNING: last inactive check more than PT1.5S (3*evs.inactive_check_period) ago (PT1.82897S), skipping check
[Fri Sep 1 07:44:49.903 2023] [359] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled
[Fri Sep 1 07:44:49.915 2023] [359] rt: table products_4: diskchunk 7(8), segments 32 saved in 8.825355 (8.839145) sec, RAM saved/new 127399195/2151830 ratio 0.950000 (soft limit 127506841, conf limit 134217728)
rt: table products_4: diskchunk 7(8), segments 32 saved in 8.825355 (8.839145) sec, RAM saved/new 127399195/2151830 ratio 0.950000 (soft limit 127506841, conf limit 134217728)
[Fri Sep 1 07:47:18.291 2023] [363] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled
full log worker.zip 53 KB
and full log after worker-0
manual restart (latest version from branch 6.0.5.1
):
Columnar version mismatch
--2023-09-01 20:16:56-- https://repo.manticoresearch.com/repository/manticoresearch_jammy/dists/jammy/main/binary-amd64/manticore-columnar-lib_2.2.4-230822-5aec342_amd64.deb
Resolving repo.manticoresearch.com (repo.manticoresearch.com)... 49.12.119.254
Connecting to repo.manticoresearch.com (repo.manticoresearch.com)|49.12.119.254|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1641468 (1.6M) [application/octet-stream]
Saving to: ‘/tmp/manticore-columnar-lib_2.2.4-230822-5aec342_amd64.deb’
0K .......... .......... .......... .......... .......... 3% 1.06M 1s
50K .......... .......... .......... .......... .......... 6% 1.10M 1s
100K .......... .......... .......... .......... .......... 9% 14.0M 1s
150K .......... .......... .......... .......... .......... 12% 1.17M 1s
200K .......... .......... .......... .......... .......... 15% 15.4M 1s
250K .......... .......... .......... .......... .......... 18% 15.6M 1s
300K .......... .......... .......... .......... .......... 21% 16.0M 1s
350K .......... .......... .......... .......... .......... 24% 1.23M 1s
400K .......... .......... .......... .......... .......... 28% 15.6M 0s
450K .......... .......... .......... .......... .......... 31% 16.6M 0s
500K .......... .......... .......... .......... .......... 34% 16.1M 0s
550K .......... .......... .......... .......... .......... 37% 16.4M 0s
600K .......... .......... .......... .......... .......... 40% 16.9M 0s
650K .......... .......... .......... .......... .......... 43% 16.4M 0s
700K .......... .......... .......... .......... .......... 46% 1.99M 0s
750K .......... .......... .......... .......... .......... 49% 11.9M 0s
800K .......... .......... .......... .......... .......... 53% 15.5M 0s
850K .......... .......... .......... .......... .......... 56% 15.3M 0s
900K .......... .......... .......... .......... .......... 59% 15.6M 0s
950K .......... .......... .......... .......... .......... 62% 14.7M 0s
1000K .......... .......... .......... .......... .......... 65% 15.6M 0s
1050K .......... .......... .......... .......... .......... 68% 15.5M 0s
1100K .......... .......... .......... .......... .......... 71% 15.6M 0s
1150K .......... .......... .......... .......... .......... 74% 12.3M 0s
1200K .......... .......... .......... .......... .......... 77% 15.7M 0s
1250K .......... .......... .......... .......... .......... 81% 15.6M 0s
1300K .......... .......... .......... .......... .......... 84% 6.89M 0s
1350K .......... .......... .......... .......... .......... 87% 14.1M 0s
1400K .......... .......... .......... .......... .......... 90% 15.8M 0s
1450K .......... .......... .......... .......... .......... 93% 13.5M 0s
1500K .......... .......... .......... .......... .......... 96% 15.6M 0s
1550K .......... .......... .......... .......... .......... 99% 14.3M 0s
1600K .. 100% 5.58T=0.3s
2023-09-01 20:17:01 (5.45 MB/s) - ‘/tmp/manticore-columnar-lib_2.2.4-230822-5aec342_amd64.deb’ saved [1641468/1641468]
Mount success
2023-09-01 20:17:02,848 CRIT Supervisor is running as root. Privileges were not dropped because no user is specified in the config file. If you intend to run as root, you can set user=root in the config file to avoid this message.
2023-09-01 20:17:02,858 INFO RPC interface 'supervisor' initialized
2023-09-01 20:17:02,858 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-09-01 20:17:02,859 INFO supervisord started with pid 50
2023-09-01 20:17:03,865 INFO spawned: 'quorum_recover' with pid 51
2023-09-01 20:17:03,870 INFO spawned: 'searchd_replica' with pid 52
localhost - 2023-09-01 20:17:04 - 3 - Replication mode: multi-master
localhost - 2023-09-01 20:17:04 - 7 - Manticore json content: {
"clusters": {
"weox_cluster": {
"nodes": "manticore-worker-1.manticore-worker-svc.manticore.svc.cluster.local:9312,manticore-worker-2.manticore-worker-svc.manticore.svc.cluster.local:9312",
"options": "",
"indexes": ["products_1", "products_10", "products_11", "products_2", "products_3", "products_4", "products_5", "products_6", "products_7", "products_8", "products_9"]
}
},
"indexes": {
"products_1": {
"type": "rt",
"path": "products_1"
},
"products_2": {
"type": "rt",
"path": "products_2"
},
"products_3": {
"type": "rt",
"path": "products_3"
},
"products_4": {
"type": "rt",
"path": "products_4"
},
"products_5": {
"type": "rt",
"path": "products_5"
},
"products_6": {
"type": "rt",
"path": "products_6"
},
"products_7": {
"type": "rt",
"path": "products_7"
},
"products_8": {
"type": "rt",
"path": "products_8"
},
"products_9": {
"type": "rt",
"path": "products_9"
},
"products_10": {
"type": "rt",
"path": "products_10"
},
"products_11": {
"type": "rt",
"path": "products_11"
}
}
}
localhost - 2023-09-01 20:17:04 - 3 - Pods count 2
localhost - 2023-09-01 20:17:04 - 3 - Non empty conf
localhost - 2023-09-01 20:17:04 - 3 - Query: show status
localhost - 2023-09-01 20:17:04 - 3 - Query: show status
localhost - 2023-09-01 20:17:04 - 3 - Update nodes list ["manticore-worker-1.manticore-worker-svc.manticore.svc.cluster.local:9312","manticore-worker-2.manticore-worker-svc.manticore.svc.cluster.local:9312"]
2023-09-01 20:17:04,821 INFO spawned: 'searchd' with pid 57
[Fri Sep 1 20:17:04.856 2023] [57] WARNING: Error initializing columnar storage: MCL requires a CPU that supports SSE 4.2
[Fri Sep 1 20:17:04.856 2023] [57] WARNING: Error initializing secondary index: MCL requires a CPU that supports SSE 4.2
[Fri Sep 1 20:17:04.868 2023] [57] using config file '/etc/manticoresearch/manticore.conf' (606 chars)...
2023-09-01 20:17:04,869 INFO success: quorum_recover entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-09-01 20:17:04,869 INFO success: searchd_replica entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
[Fri Sep 1 20:17:04.873 2023] [57] WARNING: secondary_indexes set but failed to initialize secondary library: MCL requires a CPU that supports SSE 4.2
[Fri Sep 1 20:17:04.894 2023] [57] starting daemon version '6.2.12 dc5144d35@230822' ...
starting daemon version '6.2.12 dc5144d35@230822' ...
[Fri Sep 1 20:17:04.894 2023] [57] listening on UNIX socket /var/run/mysqld/mysqld.sock
listening on UNIX socket /var/run/mysqld/mysqld.sock
[Fri Sep 1 20:17:04.895 2023] [57] listening on all interfaces for mysql, port=9306
listening on all interfaces for mysql, port=9306
[Fri Sep 1 20:17:04.895 2023] [57] listening on all interfaces for sphinx and http(s), port=9308
listening on all interfaces for sphinx and http(s), port=9308
[Fri Sep 1 20:17:04.895 2023] [57] listening on all interfaces for VIP mysql, port=9301
listening on all interfaces for VIP mysql, port=9301
[Fri Sep 1 20:17:04.895 2023] [57] listening on 10.233.86.91:9312 for sphinx and http(s)
listening on 10.233.86.91:9312 for sphinx and http(s)
Manticore 6.2.12 dc5144d35@230822
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)
precaching table 'products_1'
[Fri Sep 1 20:17:04.993 2023] [61] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled
2023-09-01 20:17:05,996 INFO success: searchd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
localhost - 2023-09-01 20:17:06 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_2'
localhost - 2023-09-01 20:17:07 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_3'
localhost - 2023-09-01 20:17:08 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_4'
[Fri Sep 1 20:17:08.537 2023] [63] WARNING: last message repeated 12 times
WARNING: last message repeated 12 times
[Fri Sep 1 20:17:08.537 2023] [63] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled
localhost - 2023-09-01 20:17:09 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_5'
[Fri Sep 1 20:17:10.042 2023] [63] WARNING: last message repeated 11 times
WARNING: last message repeated 11 times
[Fri Sep 1 20:17:10.042 2023] [63] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled
localhost - 2023-09-01 20:17:10 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_6'
localhost - 2023-09-01 20:17:11 - 3 - Manticore connect exception (localhost:9306) Connection refused
localhost - 2023-09-01 20:17:12 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_7'
precaching table 'products_8'
[Fri Sep 1 20:17:12.805 2023] [66] WARNING: last message repeated 9 times
WARNING: last message repeated 9 times
[Fri Sep 1 20:17:12.805 2023] [66] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled
localhost - 2023-09-01 20:17:13 - 3 - Manticore connect exception (localhost:9306) Connection refused
precaching table 'products_9'
precaching table 'products_10'
precaching table 'products_11'
[Fri Sep 1 20:17:13.860 2023] [72] WARNING: secondary library not loaded; secondary index(es) disabled
WARNING: secondary library not loaded; secondary index(es) disabled
localhost - 2023-09-01 20:17:14 - 3 - Manticore connect exception (localhost:9306) Connection refused
[Fri Sep 1 20:17:14.563 2023] [68] prereading 11 tables
prereading 11 tables
localhost - 2023-09-01 20:17:15 - 3 - Wait until manticore-worker-0 came alive
[Fri Sep 1 20:17:15.677 2023] [57] accepting connections
accepting connections
[Fri Sep 1 20:17:15.966 2023] [72] [BUDDY] started v1.0.18 '/usr/share/manticore/modules/manticore-buddy/bin/manticore-buddy --listen=http://0.0.0.0:9308 --threads=16' at http://127.0.0.1:45699
[BUDDY] started v1.0.18 '/usr/share/manticore/modules/manticore-buddy/bin/manticore-buddy --listen=http://0.0.0.0:9308 --threads=16' at http://127.0.0.1:45699
localhost - 2023-09-01 20:17:16 - 3 - Query: show status
localhost - 2023-09-01 20:17:16 - 3 - Cluster exist
localhost - 2023-09-01 20:17:16 - 3 - Query: show tables
localhost - 2023-09-01 20:17:16 - 3 - Call balancer 10.233.86.48:8080. Response: localhost - 2023-09-01 20:17:16 - 3 - Another process observer is already running
localhost - 2023-09-01 20:17:16 - 3 - Call balancer 10.233.65.63:8080. Response:
localhost - 2023-09-01 20:17:16 - 3 - Call balancer 10.233.86.23:8080. Response: localhost - 2023-09-01 20:17:16 - 3 - Another process observer is already running
[Fri Sep 1 20:17:16.340 2023] [64] [BUDDY] Loaded plugins:
[BUDDY] Loaded plugins:
[Fri Sep 1 20:17:16.340 2023] [64] [BUDDY] core: empty-string, backup, emulate-elastic, insert, select, show, cli-table, plugin, test, insert-mva
[BUDDY] core: empty-string, backup, emulate-elastic, insert, select, show, cli-table, plugin, test, insert-mva
[Fri Sep 1 20:17:16.340 2023] [64] [BUDDY] local:
[BUDDY] local:
[Fri Sep 1 20:17:16.340 2023] [64] [BUDDY] extra:
[BUDDY] extra:
2023-09-01 20:17:16,340 INFO exited: searchd_replica (exit status 0; expected)
[Fri Sep 1 20:17:33.515 2023] [68] preread 11 tables in 18.952 sec
preread 11 tables in 18.952 sec
full log after worker-1
manual restart (latest version from branch 6.0.5.1
)
client logs:
Sep 2, 2023 @ 00:19:55.893"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:55.614"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:54.988"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:53.327"index products_8: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:53.143"index products_8: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:53.019"index products_5: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:52.198"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:51.748"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:51.038"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:49.537"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:48.351"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:47.195"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:47.168"index products_4: agent 10.233.121.76:9312: agent closed connection"
Sep 2, 2023 @ 00:19:29.283"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:28.452"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:28.396"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:27.670"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:27.646"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:27.641"index products_1: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:27.587"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:19:26.023"index products_4: agent 10.233.121.76:9312: receiving failure (errno=111, msg=Connection refused)"
Sep 2, 2023 @ 00:18:50.298"index products_4: agent 10.233.121.149:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 2, 2023 @ 00:18:49.896"index products_4: agent 10.233.121.149:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 2, 2023 @ 00:18:49.895"index products_4: agent 10.233.121.149:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 2, 2023 @ 00:18:49.894"index products_1: agent 10.233.121.149:9312: receiving failure (errno=104, msg=Connection reset by peer)"
Sep 2, 2023 @ 00:18:49.721"index products_11: agent 10.233.121.149:9312: query timed out"
POD logs 3MB of PHP stacktraces in 3 minutes :(
worker-1-manual-restart.zip 50KB archive
Also tried with the 6.0.5.1 branch. Both on a fresh cluster and one that was existing.
I can't get any worker node to start.
Also tried with the 6.0.5.1 branch. Both on a fresh cluster and one that was existing.
I can't get any worker node to start.
[Tue Sep 5 13:13:16.298 2023] [85] FATAL: no AF_INET address found for: $ip, error -2: Name or service not known
It's strange to see this line, can you chow content of your worker's configmap?
My bad, didn't see the change in the configmap on that branch:
listen = $hostname:9312
listen = $hostname:9315-9415:replication
node_address = $hostname
Works now. I will try if the cluster breakdown issue still happens.
Thanks!
Seems like nodes can't join the cluster:
localhost - 2023-09-06 10:13:27 - 3 - Exception until query processing. Query: JOIN CLUSTER opengeocoding_cluster at 'manticoresearch-worker-0.manticoresearch-worker-svc:9312'
. Error: mysqli_sql_exception: cluster 'opengeocoding_cluster', no nodes available(manticoresearch-worker-0.manticoresearch-worker-svc:9312), error: unknown listen protocol type 'rep' in /etc/manticoresearch/vendor/manticoresoftware/manticoresearch-auto-replication/src/Manticore/ManticoreConnector.php:218
Full logs: worker-new.txt
Config:
searchd
{
listen = /var/run/mysqld/mysqld.sock:mysql41
listen = 9306:mysql41
listen = 9308:http
listen = 9301:mysql_vip
listen = $hostname:9312
listen = $hostname:9315-9415:replication
node_address = $hostname
binlog_path = /var/lib/manticore
log = /dev/stdout
query_log = /dev/stdout
query_log_format = sphinxql
pid_file = /var/run/manticore/searchd.pid
data_dir = /var/lib/manticore
shutdown_timeout = 25s
auto_optimize = 1
auto_schema = 0
optimize_cutoff = 4
}
indexer
{
mem_limit = 2048M
max_iops = 3000
max_iosize = 200M
}
Note: every pod is up and running. I can run CREATE TABLE queries but queries to opengeocoding_cluster
are returning an error saying the cluster doesn't exist
could you also change your daemon start cli by adding --logreplication
then provide searchd.log from all pods again
could you also change your daemon start cli by adding
--logreplication
then provide searchd.log from all pods again
I'll build it today
I'll build it today
Already done. Branch still 6.0.5.1
.
Please don't forget that you should to use ImagePullPolicy: Always
during our tries
It seems like 6.2.12.1 has only been updated 11 days ago https://hub.docker.com/r/manticoresearch/helm-worker/tags
Should we try with the 6.2.12.0 image you uploaded?
It seems like 6.2.12.1 has only been updated 11 days ago https://hub.docker.com/r/manticoresearch/helm-worker/tags
Should we try with the 6.2.12.0 image you uploaded?
Ohh, sorry, fat fingers) Ready in 6.2.12.1
I think the quotes are not properly escaped :
Query: JOIN CLUSTER opengeocoding_cluster at 'manticoresearch-worker-0.manticoresearch-worker-svc:9312'
[Tue Sep 12 04:21:41.474 2023] [81] DEBUG: P01: syntax error, unexpected identifier near 'JOIN CLUSTER opengeocoding_cluster at 'manticoresearch-worker-0.manticoresearch-worker-svc:9312''
I think the quotes are not properly escaped :
Query: JOIN CLUSTER opengeocoding_cluster at 'manticoresearch-worker-0.manticoresearch-worker-svc:9312' [Tue Sep 12 04:21:41.474 2023] [81] DEBUG: P01: syntax error, unexpected identifier near 'JOIN CLUSTER opengeocoding_cluster at 'manticoresearch-worker-0.manticoresearch-worker-svc:9312''
No, because I tested this build on minikibe and at our cluster. Attach please logs from worker-0 also
It has a similar issue:
Query: CREATE CLUSTER opengeocoding_cluster
[Wed Sep 13 13:02:26.082 2023] [81] DEBUG: P01: syntax error, unexpected CREATE near 'CREATE CLUSTER opengeocoding_cluster'
DEBUG: P01: syntax error, unexpected CREATE near 'CREATE CLUSTER opengeocoding_cluster'
And attach please your worker configmap
kubectl -n manticore-helm get cm manticore-helm-manticoresearch-worker-config -o yaml
not sure that is the issue but seems node name is too long,
ie from log you provided manticoresearch-worker-0.manticoresearch-worker-svc.databases-production-manticoresearch-opengeocoding.svc.cluster.local:9312,manticoresearch-worker-0.manticoresearch-worker-svc.databases-production-manticoresearch-opengeocoding.svc.cluster.local:9315:replication
this single node name is ok - but when cluster have multiple nodes these put into single string that string could be cut at 1024
chars
Could you check that short name of the pod \ node will not got this issue?
seems replication provider Galera has also its own restriction on node individual name to 256 chars
https://github.com/codership/wsrep-API/blob/e988d99c8619d91dd32fda3e66eb9b098eb0cfc8/wsrep_api.h#L235 https://github.com/codership/wsrep-API/blob/e988d99c8619d91dd32fda3e66eb9b098eb0cfc8/wsrep_api.h#L322
Seems like it worked 👍
Thanks! I'll now test if the issue still happens.
It seems to work really well under load & after restart.
I think there are some improvements to be made on the health probes to mark the node as down if it's not part of the cluster (I had to manually restart the node) but I guess it's a different issue!
Thanks a lot!
could you provide diff of changes in your config to make sure what exactly helps you?
I just reduced the number of characters from my pods.
Seems like my cluster broke down after a few days. Out of 5 nodes 2 are working but the third can't start.
Note: i have quorumRecovery = true & replicationMode: multi-master
It seems to complain about binlog files missing:
FATAL: binlog: log open error: failed to open /var/lib/manticore/binlog.009: No such file or directory
My only way to fix the cluster was to delete the persistent volumes and restart the nodes.
Seems like they are getting the data back slowly now.
Sorry for the spam.
Now I realize that I do get the unknown cluster 'opengeocoding_cluster'"
issue again.
Sorry for the spam.
Now I realize that I do get the
unknown cluster 'opengeocoding_cluster'"
issue again.
1) I need more logs. 2) Is cluster already exist on worker-0??
Yes the cluster seems to exist on worker-0.
FYI:
I did more debugging:
Trying to run this command:
REPLACE INTO opengeocoding_cluster:openstreetdata_houses(id,street,number,unit,city,district,region,postcode,lat,long,country_code) VALUES (4347932550137613764,'Av. Sant Joan de Caselles','15','','Canillo','','Canillo','AD100',42.5662011,1.5987579, 'ad');
By connecting directly to each worker node:
Actually workers 0, 2, 3, 4 return success, worker 1 return:
Error 1064 (42000): table 'openstreetdata_houses' is not in any cluster, use just 'openstreetdata_houses'
I also tried to restart worker 1 but same issue.
I assume that if I delete the corresponding PVC and restart worker 1, it'll start working again.
from logs you provided I see that node1 executed
localhost - 2023-09-15 12:48:57 - 3 - Query: JOIN CLUSTER opengeocoding_cluster at 'manticoresearch-worker-3.manticoresearch-worker-svc:9312'
and its result is unknown as there is no verbosity log at the node
from logs at node0 and node2 result is also unknown as the last messages at these nodes are earlier that from node1
at the node0
[Fri Sep 15 12:09:42.642 2023] [50] DEBUG: Preread successfully finished
and at the node2
[Fri Sep 15 12:11:17.279 2023] [47] DEBUG: Preread successfully finished
I need logs from all nodes with --logreplicaion
enabled to check all replication events and they should the same timestamp in logs to line up events from one node to events to all other nodes
Or you might provide reproducible case that we could debug here locally.
I also see at the logs files there are still long names for nodes - not sure is this relates to the issue or nor as the root of the issue is still not clear.
--logreplicaion
so you need to restart your pods with ImagePullPolicy: Always
master-master
mode, now master-slave
For now my cluster is working as intented so I can't provide anything additional but will do when/if this happens again.
@djklim87 FATAL: malformed or unknown option near '--logreplicaion';
getting this now.
Seems like there is a typo, it should be --logreplication
Description
Manticore cluster of 3 nodes with master-slave replication every time breakdown after first worker node resart
cluster deployed by Helm chart in Kubernetes v1.25.6, Coredns 1.9.3
the main error is:
WARNING: cluster 'weox_cluster': invalid nodes '10.233.121.99:9315,10.233.86.243:9315'(10.233.65.7:9312,10.233.86.243:9312,10.233.121.99:9312), replication is disabled, error: no AF_INET address found for: manticore-worker-0.manticore-worker-svc
DNS name manticore-worker-0.manticore-worker-svc is resolving
How to reproduce
restart the first worker POD
manticore-worker-0
and logs from
replica.php
:in an endless loop the same error messages (thousands of log lines per minute):
I can provide more details if needed.
It's very important for us to fix this issue because we lost our Manticore cluster every time after any fail or maintenance.