Vernemq running on Kubernetes but sometimes not forwarding received messages

tinto82 commented 6 years ago

Hello, our environment is made up of at least 3 nodes (max 5 nodes) running with Kubernetes on Google Cloud Platform. Over that we were working with a compiled by us version of Vernemq, in particular a cluster made up of 3 Pods; this compiled version was created starting by Vernemq 1.3.0 and modifying the webhooks pool dimension. Pods resources limits were 1 cpu and 3 Gb RAM. From July we had these problems: sometimes a POD, despite running, correctly accepting subscriptions and receiving messages, didn't forward the received messages to subscribers. We used to unlock this situation restarting the stuck Pod. To try to solve our problem we decided to upgrade Vernemq version, so now we are working with official docker image "erlio/docker-vernemq:1.5.0". New Pods have same resources limits but we added an HAProxy in front of container Vernemq. This is the actual vernemq-conf:

# defines the default nr of allowed concurrent
# connections per listener
listener.max_connections = 50000

# defines the nr. of acceptor processes waiting
# to concurrently accept new connections
listener.nr_of_acceptors = 30

# used when clients of a particular listener should
# be isolated from clients connected to another
# listener.
listener.mountpoint = off

# log setting 
log.console = console
log.console.file = /vernemq/log/console.log
log.console.level = info
log.error.file = /dev/null

log.crash = on
log.crash.file = /etc/vernemeq/crash.log
log.crash.maximum_message_size = 64KB
log.crash.size = 10MB
log.crash.rotation = $D0
log.crash.rotation.keep = 5

# cluster cookies
distributed_cookie = ***

allow_register_during_netsplit = on
allow_publish_during_netsplit = on
allow_subscribe_during_netsplit = on
allow_unsubscribe_during_netsplit = on

shared_subscription_policy = prefer_local

erlang.distribution.port_range.minimum = 9100
erlang.distribution.port_range.maximum = 9199

listener.ssl.cafile = ***
listener.ssl.certfile = ***
listener.ssl.keyfile = ***
listener.ssl.require_certificate = off
listener.ssl.use_identity_as_username = off
listener.ssl.default = 0.0.0.0:9883

listener.wss.cafile = ***
listener.wss.certfile = ***
listener.wss.keyfile = ***
listener.wss.require_certificate = off
listener.wss.use_identity_as_username = off
listener.wss.default = 0.0.0.0:8083

listener.tcp.default = 0.0.0.0:2883
listener.tcp.alternative1 = 0.0.0.0:2884
listener.tcp.alternative2 = 0.0.0.0:2885
listener.tcp.alternative3 = 0.0.0.0:2886
listener.tcp.alternative4 = 0.0.0.0:2887
listener.ws.default = 0.0.0.0:8080
listener.vmq.default = 0.0.0.0:44053
listener.http.default = 0.0.0.0:8888

plugins.vmq_diversity = off

plugins.vmq_passwd = on
vmq_passwd.password_file = /etc/vernemq/vmq.passwd
vmq_passwd.password_reload_interval = 10

plugins.vmq_acl = on
vmq_acl.acl_file = /etc/vernemq/vmq.acl
vmq_acl.acl_reload_interval = 10

plugins.vmq_webhooks = on
vmq_webhooks.ccs-webauth.hook = auth_on_register
vmq_webhooks.ccs-webauth.endpoint = ***
vmq_webhooks.ccs-webauth1.hook = auth_on_publish
vmq_webhooks.ccs-webauth1.endpoint = ***
vmq_webhooks.ccs-webauth2.hook = auth_on_subscribe
vmq_webhooks.ccs-webauth2.endpoint = ***

max_message_size = 32768
max_client_id_size = 50
persistent_client_expiration = 1d

retry_interval = 10
max_inflight_messages = 100
max_online_messages = 1000
max_offline_messages = 5

#leveldb.maximum_memory = 10485760
#leveldb.write_buffer_size_min = 1048576
#leveldb.write_buffer_size_max = 2097152

leveldb.maximum_memory = 209715200
leveldb.write_buffer_size_min = 10485760
leveldb.write_buffer_size_max = 104857600

vmq_webhooks.pool_max_connections = 1000
vmq_webhooks.pool_timeout = 60000

We added a check script to every vernemq POD: this subscribes on the specific POD, sends a message and waits for the reply. This check script is used by Kubernetes liveness, so when it fails 5 consecutive times, the POD is automatically restarted. Also with the new version this check sometimes fails. In logs we found this stacktrace:

12:05:31.644 [error] gen_server plumtree_metadata_manager terminated with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472
12:05:31.645 [error] CRASH REPORT Process plumtree_metadata_manager with 0 neighbours exited with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in gen_server:terminate/7 line 812
12:05:31.646 [error] gen_server plumtree_broadcast terminated with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...}
12:05:31.646 [error] CRASH REPORT Process plumtree_broadcast with 0 neighbours exited with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in gen_server2:terminate/3 line 995
12:05:31.646 [error] Supervisor plumtree_sup had child plumtree_metadata_manager started with plumtree_metadata_manager:start_link() at <0.225.0> exit with reason no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in context child_terminated
12:05:31.647 [error] Supervisor plumtree_sup had child plumtree_broadcast started with plumtree_broadcast:start_link() at <0.224.0> exit with reason {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in context child_terminated

Can you help us? Thank you very much, Mauro

schermata del 2018-09-04 17-04-24

ioolkos commented 6 years ago

@tinto82 thanks for reporting! So now we seem to have a second data point for https://github.com/erlio/vernemq/issues/715

I understand this happens completely randomly? (that is you don't have any observation how & when this happens, or what context could make it more likely etc)

@dergraf more ideas how to address this? (I marked it as a bug in #715)

tinto82 commented 6 years ago

Hi ioolkos,

yes, it happens randomly. There is something i can do to help you understand the problem? Some data to retrieve or some test to do?

Thank you, Mauro

ioolkos commented 6 years ago

@tinto82 thanks for offering your help! Will let you know as soon as we have some more hypothesis/ideas.

tinto82 commented 6 years ago

@ioolkos ok, i'll wait for news. Thank you

tinto82 commented 6 years ago

I add more informations in order to better contextualize. In 6 months of execution before July we did't have any problems. The number of subscribers is raising slowly and constantly. More or less we have now 400-450 new subscribers every month, at July the total number was indicatively 2000. From our experience it seems the problem occurs after some days of execution. The specific liveness check we added with the last deploy is used to restart Pod in the event of a stuck. We also have a test environment, very similar to the production one (where we have the problem); in this environment nodes are preemptible, so restart within 24 hours. In test environment we tried with 28000 subscribers and publishing between 500 to 2000 messages per second and similar problems didn't occur.

tinto82 commented 6 years ago

Hi, in the last 14 days our 3 vernemq pods restarted 15 times, with a pods life between 9 hours and almost 12 days. Hope that can help you.

dergraf commented 6 years ago

@tinto82 Could be related,but the listener.vmq.default should be configured to a specifig IP:PORT that is reachable by the other pods instead of using 0.0.0.0.

Can you exclude that the container went out of memory and terminated for that reason?

tinto82 commented 6 years ago

@dergraf on startup the configuration is updated and ip 0.0.0.0 is replaced with the pod one. I can exclude is a memory problem, eg yesterday a pod restarted and his used memory was more or less 400 Mb, but the pod memory limit is 3Gb.

The log error is:

14:55:56.244 [error] Supervisor plumtree_sup had child plumtree_metadata_manager started with plumtree_metadata_manager:start_link() at <0.225.0> exit with reason no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in context child_terminated 14:55:56.244 [error] Supervisor plumtree_sup had child plumtree_metadata_manager started with plumtree_metadata_manager:start_link() at <0.225.0> exit with reason no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in context child_terminated 14:55:56.244 [error] CRASH REPORT Process plumtree_broadcast with 0 neighbours exited with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in gen_server2:terminate/3 line 995 14:55:56.244 [error] gen_server plumtree_broadcast terminated with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} 14:55:56.243 [error] CRASH REPORT Process plumtree_metadata_manager with 0 neighbours exited with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in gen_server:terminate/7 line 812 14:55:56.243 [error] gen_server plumtree_metadata_manager terminated with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472 14:55:56.239 [error] got unknown sync event in wait_for_offline state {cleanup,session_taken_over}

immagine

tinto82 commented 6 years ago

@dergraf in addition to ip 0.0.0.0, do you notice anything else strange in the configuration? Can subscribers/publishers behavior be a cause of the problem? Some devices connect and disconnect very often to VerneMQ: could this behavior affect the stability of the application?

dergraf commented 6 years ago

Well it depends, do your subscribers/publishers always use a random client id when connecting? If they do, this could (become) certainly an issue.

larshesel commented 6 years ago

Hi @tinto82 do the clients you have use clean_session=false or clean_session=true or a mix? do you use allow_multiple_session=on?

Due to the 14:55:56.244 [error] gen_server plumtree_broadcast terminated with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]}, this looks like the same issue as erlio/vernemq#715 and erlio/vernemq#547, so I'm trying to figure out what those three issues have in common.

So far the only common denominator is you run docker, but it's hard to understand why that would play a role...

tinto82 commented 6 years ago

Hi @dergraf, every device have a specific client id

tinto82 commented 6 years ago

Hi @larshesel, clients use clean_session=true and we don't specify allow_multiple_session in our configuration; the default value is off, right? Thank you

larshesel commented 6 years ago

Yes, that is correct - it defaults to off - then my suspicion that allow_multiple_sessions had something to do with it was wrong :) Need to form another hypothesis.

tinto82 commented 6 years ago

Hi everybody, we noticed that sometimes some clients have a wrong behavior we are deepening: they do repeatly connect every second with the same client_id and without subscribing to any topic. In correspondence of these repeated connections we can find in logs a warning

08:55:07.113 [warning] can't migrate the queue[<0.4193.61>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/eee/ffffffff/hhh/k">>} due to no responsible remote node found

that I have reported in issue https://github.com/erlio/vernemq/issues/862. Can this be related to the problem about "stuck PODs"?

Thank you

tinto82 commented 6 years ago

Hi, today a client started to make connections with a greater frequency than cases seen so far (12 connect/second more or less). In concomitance with the beginning of the connections we started to have VerneMQ reboots. Below I report logs:

09:25:17.939 [warning] can't migrate the queue[<0.27547.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:24.371 [warning] can't migrate the queue[<0.27793.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:27.845 [warning] can't migrate the queue[<0.27958.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:33.914 [warning] can't migrate the queue[<0.28176.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:25:37.726 [warning] can't migrate the queue[<0.28283.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:26:01.844 [warning] can't migrate the queue[<0.29285.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:01.133 [warning] can't migrate the queue[<0.31830.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:02.441 [warning] can't migrate the queue[<0.31887.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:09.074 [warning] can't migrate the queue[<0.32172.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:13.133 [warning] can't migrate the queue[<0.32374.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:19.119 [warning] can't migrate the queue[<0.32613.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:20.325 [warning] can't migrate the queue[<0.32659.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:21.590 [warning] can't migrate the queue[<0.32714.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:21.591 [warning] can't migrate the queue[<0.32714.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:21.592 [warning] can't migrate the queue[<0.32714.2>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:24.121 [warning] can't migrate the queue[<0.54.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:40.013 [warning] can't migrate the queue[<0.734.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:53.927 [warning] can't migrate the queue[<0.1330.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:27:56.353 [warning] can't migrate the queue[<0.1449.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:28:09.609 [warning] can't migrate the queue[<0.2047.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
...
09:28:09.609 [warning] can't migrate the queue[<0.2047.3>] for subscriber {[],<<"aaaa/bbbbbbbbbbbb/ccc/DD000001/eee/1">>} due to no responsible remote node found
09:28:19.384 [info] completed metadata exchange with 'VerneMQ@vmq-0.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 4 keys
09:28:29.381 [info] completed metadata exchange with 'VerneMQ@vmq-2.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 3 keys
09:28:29.508 [warning] client {[],<<"aaaa/rrrrrrrrrrrr/ccc/DD000015/eee/1">>} with username <<"aaabbbccc">> stopped due to keepalive expired
09:28:31.254 [error] got unknown sync event in wait_for_offline state {cleanup,session_taken_over}
09:28:31.277 [error] got unknown sync event in wait_for_offline state {cleanup,session_taken_over}
09:28:31.291 [error] gen_server plumtree_metadata_manager terminated with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472
09:28:31.292 [error] CRASH REPORT Process plumtree_metadata_manager with 0 neighbours exited with reason: no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in gen_server:terminate/7 line 812
09:28:31.292 [error] gen_server plumtree_broadcast terminated with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...}
09:28:31.293 [error] CRASH REPORT Process plumtree_broadcast with 0 neighbours exited with reason: {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in gen_server2:terminate/3 line 995
09:28:31.293 [error] Supervisor plumtree_sup had child plumtree_metadata_manager started with plumtree_metadata_manager:start_link() at <0.225.0> exit with reason no case clause matching [] in plumtree_metadata_manager:store/2 line 472 in context child_terminated
09:28:31.294 [error] Supervisor plumtree_sup had child plumtree_broadcast started with plumtree_broadcast:start_link() at <0.224.0> exit with reason {{{case_clause,[]},[{plumtree_metadata_manager,store,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,472}]},{plumtree_metadata_manager,read_merge_write,2,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,457}]},{plumtree_metadata_manager,handle_call,3,[{file,"/opt/vernemq/distdir/1.5.0/_build/default/lib/plumtree/src/plumtree_metadata_manager.erl"},{line,316}]},{gen_server,try_handle_call,...},...]},...} in context child_terminated
09:28:39.391 [info] completed metadata exchange with 'VerneMQ@vmq-2.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 3 keys
09:28:41.400 [info] completed metadata exchange with 'VerneMQ@vmq-2.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 3 keys
09:28:49.393 [info] completed metadata exchange with 'VerneMQ@vmq-2.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 5 keys
09:28:51.313 [info] completed metadata exchange with 'VerneMQ@vmq-0.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 2 keys
09:28:59.383 [info] completed metadata exchange with 'VerneMQ@vmq-2.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 1 keys
09:29:01.313 [info] completed metadata exchange with 'VerneMQ@vmq-2.local'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 1 keys

We are beginning to relate the reconnections with VerneMQ reboots

tinto82 commented 6 years ago

Hi, we have recreated the problem also in a test environment we have. The test environment is similar to the production one. We developed a test client: this client make connections one after another without stopping and without closing the previous connection. The test made that 2 PODs restart in few minutes. I hope this can help your analysis.

ioolkos commented 6 years ago

Hi @tinto82 thanks. I can easily reproduce the migration warnings, even for a single node broker. It's more challenging to trigger the plumtree errors. But we're working on it.

larshesel commented 6 years ago

I've been able to reproduce the migration warnings as well, but not yet the plumtree crash. In your test scenario where you made those pods restart was that due to memory issues or did they also exhibit the plumtree crash in the logs?

tinto82 commented 6 years ago

Hi @larshesel. also in our test environment we find the plumtree crash. But i can add some new info we found in these days. Sometimes a client have a wrong behavior: create connection with Vernemq and subscribe, but create also another thread and try to make another connection. we are trying to understand this strange behavior. But we have seen also this problem: when client try to make a new connection we would expect the previous one to be closed and the new to be created. Actually we have seen that the previous connection is not closed and the new one is rejected. So the client enter in a loop trying to create the new connection and making too many calls. We looked this using WiredShark. If is useful to help you to understand the problem we can share in private with you data taken with Wiredshark.

tinto82 commented 6 years ago

Hi @larshesel,

the latest information we have provided has been helpful?

larshesel commented 6 years ago

Hi @tinto82, I've been away a while - I hope to take a look again at this soon.

tinto82 commented 5 years ago

Hi @larshesel can i do something else to help you understand the problem?

larshesel commented 5 years ago

The problem is the plumtree crash you see above, we've added some additional logging for this case in the newly released VerneMQ 1.6.0 - would you be able to test with that release? It will log errors with the following line Known issue, please report here https://github.com/erlio/vernemq/issues/715 plus some hopefully detailed enough information to help us get to the bottom of this.

larshesel commented 5 years ago

Also the change in plumtree means that VerneMQ will try to repair the data causing this(after logging it), and as such will just log the message and VerneMQ should continue working.

larshesel commented 5 years ago

Closing this due to inactivity - feel free to reopen or close a new issue if the issue persists with VerneMQ 1.9.0

vernemq / docker-vernemq

Vernemq running on Kubernetes but sometimes not forwarding received messages #67