Closed tinto82 closed 5 years ago
@tinto82 thanks for reporting! So now we seem to have a second data point for https://github.com/erlio/vernemq/issues/715
I understand this happens completely randomly? (that is you don't have any observation how & when this happens, or what context could make it more likely etc)
@dergraf more ideas how to address this? (I marked it as a bug in #715)
Hi ioolkos,
yes, it happens randomly. There is something i can do to help you understand the problem? Some data to retrieve or some test to do?
Thank you, Mauro
@tinto82 thanks for offering your help! Will let you know as soon as we have some more hypothesis/ideas.
@ioolkos ok, i'll wait for news. Thank you
I add more informations in order to better contextualize. In 6 months of execution before July we did't have any problems. The number of subscribers is raising slowly and constantly. More or less we have now 400-450 new subscribers every month, at July the total number was indicatively 2000. From our experience it seems the problem occurs after some days of execution. The specific liveness check we added with the last deploy is used to restart Pod in the event of a stuck. We also have a test environment, very similar to the production one (where we have the problem); in this environment nodes are preemptible, so restart within 24 hours. In test environment we tried with 28000 subscribers and publishing between 500 to 2000 messages per second and similar problems didn't occur.
Hi, in the last 14 days our 3 vernemq pods restarted 15 times, with a pods life between 9 hours and almost 12 days. Hope that can help you.
@tinto82 Could be related,but the listener.vmq.default
should be configured to a specifig IP:PORT that is reachable by the other pods instead of using 0.0.0.0.
Can you exclude that the container went out of memory and terminated for that reason?
@dergraf on startup the configuration is updated and ip 0.0.0.0 is replaced with the pod one. I can exclude is a memory problem, eg yesterday a pod restarted and his used memory was more or less 400 Mb, but the pod memory limit is 3Gb.
The log error is:
14:55:56.244 [error] Supervisor plumtree_sup had child plumtree_metadata_manager started with plumtree_metadata_manager:start_link() at <0.225.0> exit with reason no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in context child_terminated 14:55:56.244 [error] Supervisor plumtree_sup had child plumtree_metadata_manager started with plumtree_metadata_manager:start_link() at <0.225.0> exit with reason no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in context child_terminated 14:55:56.244 [error] CRASH REPORT Process plumtree_broadcast with 0 neighbours exited with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in gen_server2:terminate/3 line 995 14:55:56.244 [error] gen_server plumtree_broadcast terminated with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} 14:55:56.243 [error] CRASH REPORT Process plumtree_metadata_manager with 0 neighbours exited with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in gen_server:terminate/7 line 812 14:55:56.243 [error] gen_server plumtree_metadata_manager terminated with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472 14:55:56.239 [error] got unknown sync event in wait_for_offline state {cleanup,session_taken_over}
@dergraf in addition to ip 0.0.0.0, do you notice anything else strange in the configuration? Can subscribers/publishers behavior be a cause of the problem? Some devices connect and disconnect very often to VerneMQ: could this behavior affect the stability of the application?
Well it depends, do your subscribers/publishers always use a random client id when connecting? If they do, this could (become) certainly an issue.
Hi @tinto82 do the clients you have use clean_session=false or clean_session=true or a mix? do you use allow_multiple_session=on
?
Due to the 14:55:56.244 [error] gen_server plumtree_broadcast terminated with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},
this looks like the same issue as erlio/vernemq#715 and erlio/vernemq#547, so I'm trying to figure out what those three issues have in common.
So far the only common denominator is you run docker, but it's hard to understand why that would play a role...
Hi @dergraf, every device have a specific client id
Hi @larshesel, clients use clean_session=true and we don't specify allow_multiple_session in our configuration; the default value is off, right? Thank you
Yes, that is correct - it defaults to off - then my suspicion that allow_multiple_sessions
had something to do with it was wrong :) Need to form another hypothesis.
Hi everybody, we noticed that sometimes some clients have a wrong behavior we are deepening: they do repeatly connect every second with the same client_id and without subscribing to any topic. In correspondence of these repeated connections we can find in logs a warning
08:55:07.113 [warning] can't migrate the queue[<0.4193.61>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/eee/ffffffff/hhh/k">>} due to no responsible remote node found
that I have reported in issue https://github.com/erlio/vernemq/issues/862. Can this be related to the problem about "stuck PODs"?
Thank you
Hi, today a client started to make connections with a greater frequency than cases seen so far (12 connect/second more or less). In concomitance with the beginning of the connections we started to have VerneMQ reboots. Below I report logs:
09:25:17.939 [warning] can't migrate the queue[<0.27547.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:24.371 [warning] can't migrate the queue[<0.27793.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:27.845 [warning] can't migrate the queue[<0.27958.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:33.914 [warning] can't migrate the queue[<0.28176.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:37.726 [warning] can't migrate the queue[<0.28283.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:26:01.844 [warning] can't migrate the queue[<0.29285.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:01.133 [warning] can't migrate the queue[<0.31830.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:02.441 [warning] can't migrate the queue[<0.31887.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:09.074 [warning] can't migrate the queue[<0.32172.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:13.133 [warning] can't migrate the queue[<0.32374.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:19.119 [warning] can't migrate the queue[<0.32613.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:20.325 [warning] can't migrate the queue[<0.32659.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:21.590 [warning] can't migrate the queue[<0.32714.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:21.591 [warning] can't migrate the queue[<0.32714.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:21.592 [warning] can't migrate the queue[<0.32714.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:24.121 [warning] can't migrate the queue[<0.54.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:40.013 [warning] can't migrate the queue[<0.734.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:53.927 [warning] can't migrate the queue[<0.1330.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:56.353 [warning] can't migrate the queue[<0.1449.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:28:09.609 [warning] can't migrate the queue[<0.2047.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:28:09.609 [warning] can't migrate the queue[<0.2047.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
09:28:19.384 [info] completed metadata exchange with 'VerneMQ@vmq-0.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 4 keys
09:28:29.381 [info] completed metadata exchange with 'VerneMQ@vmq-2.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 3 keys
09:28:29.508 [warning] client {[],<<"aaaa/rrrrrrrrrrrr/ccc/DD000015/eee/1">>} with username <<"aaabbbccc">> stopped due to keepalive expired
09:28:31.254 [error] got unknown sync event in wait_for_offline state {cleanup,session_taken_over}
09:28:31.277 [error] got unknown sync event in wait_for_offline state {cleanup,session_taken_over}
09:28:31.291 [error] gen_server plumtree_metadata_manager terminated with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472
09:28:31.292 [error] CRASH REPORT Process plumtree_metadata_manager with 0 neighbours exited with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in gen_server:terminate/7 line 812
09:28:31.292 [error] gen_server plumtree_broadcast terminated with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...}
09:28:31.293 [error] CRASH REPORT Process plumtree_broadcast with 0 neighbours exited with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in gen_server2:terminate/3 line 995
09:28:31.293 [error] Supervisor plumtree_sup had child plumtree_metadata_manager started with plumtree_metadata_manager:start_link() at <0.225.0> exit with reason no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in context child_terminated
09:28:31.294 [error] Supervisor plumtree_sup had child plumtree_broadcast started with plumtree_broadcast:start_link() at <0.224.0> exit with reason {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in context child_terminated
09:28:39.391 [info] completed metadata exchange with 'VerneMQ@vmq-2.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 3 keys
09:28:41.400 [info] completed metadata exchange with 'VerneMQ@vmq-2.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 3 keys
09:28:49.393 [info] completed metadata exchange with 'VerneMQ@vmq-2.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 5 keys
09:28:51.313 [info] completed metadata exchange with 'VerneMQ@vmq-0.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 2 keys
09:28:59.383 [info] completed metadata exchange with 'VerneMQ@vmq-2.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 1 keys
09:29:01.313 [info] completed metadata exchange with 'VerneMQ@vmq-2.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 1 keys
We are beginning to relate the reconnections with VerneMQ reboots
Hi, we have recreated the problem also in a test environment we have. The test environment is similar to the production one. We developed a test client: this client make connections one after another without stopping and without closing the previous connection. The test made that 2 PODs restart in few minutes. I hope this can help your analysis.
Hi @tinto82 thanks. I can easily reproduce the migration warnings, even for a single node broker. It's more challenging to trigger the plumtree errors. But we're working on it.
I've been able to reproduce the migration warnings as well, but not yet the plumtree crash. In your test scenario where you made those pods restart was that due to memory issues or did they also exhibit the plumtree crash in the logs?
Hi @larshesel. also in our test environment we find the plumtree crash. But i can add some new info we found in these days. Sometimes a client have a wrong behavior: create connection with Vernemq and subscribe, but create also another thread and try to make another connection. we are trying to understand this strange behavior. But we have seen also this problem: when client try to make a new connection we would expect the previous one to be closed and the new to be created. Actually we have seen that the previous connection is not closed and the new one is rejected. So the client enter in a loop trying to create the new connection and making too many calls. We looked this using WiredShark. If is useful to help you to understand the problem we can share in private with you data taken with Wiredshark.
Hi @larshesel,
the latest information we have provided has been helpful?
Hi @tinto82, I've been away a while - I hope to take a look again at this soon.
Hi @larshesel can i do something else to help you understand the problem?
The problem is the plumtree crash you see above, we've added some additional logging for this case in the newly released VerneMQ 1.6.0 - would you be able to test with that release? It will log errors with the following line Known issue, please report here https://github.com/erlio/vernemq/issues/715
plus some hopefully detailed enough information to help us get to the bottom of this.
Also the change in plumtree means that VerneMQ will try to repair the data causing this(after logging it), and as such will just log the message and VerneMQ should continue working.
Closing this due to inactivity - feel free to reopen or close a new issue if the issue persists with VerneMQ 1.9.0
Hello, our environment is made up of at least 3 nodes (max 5 nodes) running with Kubernetes on Google Cloud Platform. Over that we were working with a compiled by us version of Vernemq, in particular a cluster made up of 3 Pods; this compiled version was created starting by Vernemq 1.3.0 and modifying the webhooks pool dimension. Pods resources limits were 1 cpu and 3 Gb RAM. From July we had these problems: sometimes a POD, despite running, correctly accepting subscriptions and receiving messages, didn't forward the received messages to subscribers. We used to unlock this situation restarting the stuck Pod. To try to solve our problem we decided to upgrade Vernemq version, so now we are working with official docker image "erlio/docker-vernemq:1.5.0". New Pods have same resources limits but we added an HAProxy in front of container Vernemq. This is the actual vernemq-conf:
We added a check script to every vernemq POD: this subscribes on the specific POD, sends a message and waits for the reply. This check script is used by Kubernetes liveness, so when it fails 5 consecutive times, the POD is automatically restarted. Also with the new version this check sometimes fails. In logs we found this stacktrace:
Can you help us? Thank you very much, Mauro