vernemq / docker-vernemq

VerneMQ Docker image - Starts the VerneMQ MQTT broker and listens on 1883 and 8080 (for websockets).
https://vernemq.com
Apache License 2.0
178 stars 231 forks source link

VerneMQ cluster not working in Ipv6 only environment on Kubernetes #350

Open avinakollu opened 1 year ago

avinakollu commented 1 year ago

Hi,

We have two environments where we are trying to deploy VerneMQ using the helm chart. One is dual stack and the other ipv6.

The dual stack environment works fine with the latest chart version. The issue however is the ipv6.

Firstly, vmq admin does not work and I see the same issue with vernemq ping.

~ $ vmq-admin
Node 'VerneMQ@vernemq-0.vernemq-headless.messaging.svc.cluster.local' not responding to pings.
~ $ vernemq ping
Node 'VerneMQ@vernemq-0.vernemq-headless.messaging.svc.cluster.local' not responding to pings.
~ $ netstat -tunlup
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:4369            0.0.0.0:*               LISTEN      191/epmd
tcp        0      0 127.0.0.1:1883          0.0.0.0:*               LISTEN      131/beam.smp
tcp        0      0 :::9100                 :::*                    LISTEN      131/beam.smp
tcp        0      0 2600:1f14:22b:5502:dd4a::1:8080 :::*                    LISTEN      131/beam.smp
tcp        0      0 :::4369                 :::*                    LISTEN      191/epmd
tcp        0      0 2600:1f14:22b:5502:dd4a::1:44053 :::*                    LISTEN      131/beam.smp
tcp        0      0 ::1:8888                :::*                    LISTEN      131/beam.smp
tcp        0      0 2600:1f14:22b:5502:dd4a::1:8888 :::*                    LISTEN      131/beam.smp
tcp        0      0 ::1:1883                :::*                    LISTEN      131/beam.smp
tcp        0      0 2600:1f14:22b:5502:dd4a::1:1883 :::*                    LISTEN      131/beam.smp

Here is the same on the dual stack cluster:

~ $ vmq-admin
Usage: vmq-admin <sub-command>

  Administrate the cluster.

  Sub-commands:
    node        Manage this node
    cluster     Manage this node's cluster membership
    session     Retrieve session information
    retain      Show and filter MQTT retained messages
    plugin      Manage plugin system
    listener    Manage listener interfaces
    metrics     Retrieve System Metrics
    api-key     Manage API keys for the HTTP management interface
    trace       Trace various aspects of VerneMQ
  Use --help after a sub-command for more details.

~ $ vernemq ping
pong
~ $ netstat -tunlup
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:8888          0.0.0.0:*               LISTEN      136/beam.smp
tcp        0      0 7.7.25.193:8888         0.0.0.0:*               LISTEN      136/beam.smp
tcp        0      0 127.0.0.1:1883          0.0.0.0:*               LISTEN      136/beam.smp
tcp        0      0 7.7.25.193:1883         0.0.0.0:*               LISTEN      136/beam.smp
tcp        0      0 0.0.0.0:9100            0.0.0.0:*               LISTEN      136/beam.smp
tcp        0      0 7.7.25.193:8080         0.0.0.0:*               LISTEN      136/beam.smp
tcp        0      0 0.0.0.0:4369            0.0.0.0:*               LISTEN      194/epmd
tcp        0      0 7.7.25.193:44053        0.0.0.0:*               LISTEN      136/beam.smp
tcp        0      0 :::4369                 :::*                    LISTEN      194/epmd

I feel Im missing some listener for the ipv6 setup. But I have exhausted all my options. I have tried most of the things I could find. Please help me figure out what I might be missing.

I can provide more logs/configs if required.

ioolkos commented 1 year ago

@avinakollu

There's an open issue: https://github.com/vernemq/vernemq/issues/1664 The vmq-admin scripts do not connect when the cluster communication is configured to use ipv6. (which means having -proto_dist inet6_tcp in vmq.args enabled, so that Erlang cluster comm uses ipv6).

I'm not sure here whether your nodes still cluster. Can you access the status page on port 8888 for one of the nodes, and check whether it shows a full cluster?

We need to find a way to fix the script issue.


:point_right: Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq :point_right: Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

avinakollu commented 1 year ago

@ioolkos Thanks for your response.

So, I have two scenarios:

  1. When I enable -proto_dist inet6_tcp in vm.args, only vernemq-0 comes up. The second one fails with the following trace:

    [rocky@ip-10-220-150-249 ~]$ kubectl logs vernemq-1 -n messaging Permissions ok: Our pod vernemq-1 belongs to StatefulSet vernemq with 2 replicas Will join an existing Kubernetes cluster with discovery node at vernemq-0.vernemq-headless.messaging.svc.cluster.local Did I previously leave the cluster? If so, purging old state. Cluster doesn't know about me, this means I've left previously. Purging old state... Password: Reenter password: config is OK -config /vernemq/data/generated.configs/app.2023.02.27.23.56.21.config -args_file /vernemq/bin/../etc/vm.args -vm_args /vernemq/bin/../etc/vm.args Exec: /vernemq/bin/../erts-12.3.2.5/bin/erlexec -boot /vernemq/bin/../releases/1.12.6.2/vernemq -config /vernemq/data/generated.configs/app.2023.02.27.23.56.21.config -args_file /vernemq/bin/../etc/vm.args -vm_args /vernemq/bin/../etc/vm.args -pa /vernemq/bin/../lib/erlio-patches -- console -noshell -noinput Root: /vernemq/bin/.. Protocol 'inet6_tcp-eval': not supported Protocol 'vmq_server_cmd:node_join('VerneMQ@vernemq-0.vernemq-headless.messaging.svc.cluster.local')': not supported

  2. Without the flag though, all 3 replicas do come up but I do not see the nodes in the cluster status page. Here is the status page:

    Screenshot 2023-02-27 at 3 46 03 PM
ioolkos commented 1 year ago

The reason is that the node tries to cluster automatically. And vmq_server_cmd:node_join/1 is a wrapper for a vmq-admin call which bring us back to the mentioned incompatibility. https://github.com/vernemq/vernemq/blob/5c14718469cc861241caa2b920ef5bca25283d71/apps/vmq_server/src/vmq_server_cmd.erl#L28

I don't see what's wrong with scenario 2. Any logs from the nodes?


:point_right: Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq :point_right: Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

ioolkos commented 1 year ago

Note to self: find a way to inject -proto_dist inet6_tcp into the noderunner escript. Maybe we need to have a second ip6-enabled version of the script and then make vmq-admin choose via a flag.

EDIT: we can add %%! -proto_dist inet6_tcp as the second line to noderunner.


:point_right: Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq :point_right: Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

avinakollu commented 1 year ago

Yeah I do have logs.

Node 1: $ kubectl logs vernemq-0 -n messaging Permissions ok: Our pod vernemq-0 belongs to StatefulSet vernemq with 1 replicas Password: Reenter password: config is OK -config /vernemq/data/generated.configs/app.2023.03.01.00.02.10.config -args_file /vernemq/bin/../etc/vm.args -vm_args /vernemq/bin/../etc/vm.args Exec: /vernemq/bin/../erts-12.3.2.5/bin/erlexec -boot /vernemq/bin/../releases/1.12.6.2/vernemq -config /vernemq/data/generated.configs/app.2023.03.01.00.02.10.config -args_file /vernemq/bin/../etc/vm.args -vm_args /vernemq/bin/../etc/vm.args -pa /vernemq/bin/../lib/erlio-patches -- console -noshell -noinput Root: /vernemq/bin/.. 00:02:12.624 [info] alarm_handler: {set,{system_memory_high_watermark,[]}} 00:02:12.728 [info] writing (updated) old actor <<217,63,70,135,63,206,115,43,49,140,165,14,237,32,235,220,239,75,136,229>> to disk 00:02:12.736 [info] writing state {[{[{actor,<217,63,70,135,63,206,115,43,49,140,165,14,237,32,235,220,239,75,136,229>>}],1}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[['VerneMQ@vernemq-0.vernemq-headless.messaging.svc.cluster.local',{[{actor,<<217,63,70,135,63,206,115,43,49,140,165,14,237,32,235,220,239,75,136,229>}],1}]],[],[],[]}}},{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}} to disk <<75,2,131,80,0,0,1,36,120,1,203,96,206,97,96,96,96,204,96,130,82,41,12,172,137,201,37,249,69,185,64,81,145,155,246,110,237,246,231,138,181,13,123,150,242,189,85,120,125,231,189,119,199,211,172,68,198,172,12,206,20,6,150,148,204,228,146,68,198,68,1,32,228,72,12,72,52,200,16,200,66,3,25,140,168,98,96,43,64,4,83,10,131,93,88,106,81,94,170,111,160,67,25,136,206,45,212,53,208,131,177,50,82,19,83,114,82,139,139,245,114,129,68,98,122,102,94,186,94,113,89,178,94,114,78,105,113,73,106,145,94,78,126,114,98,14,105,238,5,185,11,225,102,6,82,220,12,210,10,0,163,254,97,243>> 00:02:12.757 [info] Opening LevelDB SWC database at "./data/swc_meta/meta1" 00:02:12.781 [info] Opening LevelDB SWC database at "./data/swc_meta/meta2" 00:02:12.791 [info] Opening LevelDB SWC database at "./data/swc_meta/meta3" 00:02:12.800 [info] Opening LevelDB SWC database at "./data/swc_meta/meta4" 00:02:12.810 [info] Opening LevelDB SWC database at "./data/swc_meta/meta5" 00:02:12.819 [info] Opening LevelDB SWC database at "./data/swc_meta/meta6" 00:02:12.828 [info] Opening LevelDB SWC database at "./data/swc_meta/meta7" 00:02:12.837 [info] Opening LevelDB SWC database at "./data/swc_meta/meta8" 00:02:12.847 [info] Opening LevelDB SWC database at "./data/swc_meta/meta9" 00:02:12.858 [info] Opening LevelDB SWC database at "./data/swc_meta/meta10" 00:02:12.910 [info] Try to start vmq_swc: ok 00:02:12.956 [info] Opening LevelDB database at "./data/msgstore/1" 00:02:12.971 [info] Opening LevelDB database at "./data/msgstore/2" 00:02:12.985 [info] Opening LevelDB database at "./data/msgstore/3" 00:02:12.994 [info] Opening LevelDB database at "./data/msgstore/4" 00:02:13.001 [info] Opening LevelDB database at "./data/msgstore/5" 00:02:13.010 [info] Opening LevelDB database at "./data/msgstore/6" 00:02:13.019 [info] Opening LevelDB database at "./data/msgstore/7" 00:02:13.028 [info] Opening LevelDB database at "./data/msgstore/8" 00:02:13.036 [info] Opening LevelDB database at "./data/msgstore/9" 00:02:13.044 [info] Opening LevelDB database at "./data/msgstore/10" 00:02:13.053 [info] Opening LevelDB database at "./data/msgstore/11" 00:02:13.062 [info] Opening LevelDB database at "./data/msgstore/12" 00:02:13.122 [info] Try to start vmq_generic_msg_store: ok 00:02:13.230 [info] loaded 0 subscriptions into vmq_reg_trie 00:02:13.249 [info] cluster event handler 'vmq_cluster' registered

Node 2:

$ kubectl logs vernemq-1 -n messaging Permissions ok: Our pod vernemq-1 belongs to StatefulSet vernemq with 2 replicas Will join an existing Kubernetes cluster with discovery node at vernemq-0.vernemq-headless.messaging.svc.cluster.local Did I previously leave the cluster? If so, purging old state. Cluster doesn't know about me, this means I've left previously. Purging old state... Password: Reenter password: config is OK -config /vernemq/data/generated.configs/app.2023.03.01.00.04.01.config -args_file /vernemq/bin/../etc/vm.args -vm_args /vernemq/bin/../etc/vm.args Exec: /vernemq/bin/../erts-12.3.2.5/bin/erlexec -boot /vernemq/bin/../releases/1.12.6.2/vernemq -config /vernemq/data/generated.configs/app.2023.03.01.00.04.01.config -args_file /vernemq/bin/../etc/vm.args -vm_args /vernemq/bin/../etc/vm.args -pa /vernemq/bin/../lib/erlio-patches -- console -noshell -noinput Root: /vernemq/bin/.. 00:04:03.212 [info] alarm_handler: {set,{system_memory_high_watermark,[]}} 00:04:03.314 [info] writing (updated) old actor <<165,158,8,12,24,41,0,246,32,145,173,99,202,109,217,6,192,216,199,63>> to disk 00:04:03.322 [info] writing state {[{[{actor,<165,158,8,12,24,41,0,246,32,145,173,99,202,109,217,6,192,216,199,63>>}],1}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[['VerneMQ@vernemq-1.vernemq-headless.messaging.svc.cluster.local',{[{actor,<<165,158,8,12,24,41,0,246,32,145,173,99,202,109,217,6,192,216,199,63>}],1}]],[],[],[]}}},{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}} to disk <<75,2,131,80,0,0,1,36,120,1,203,96,206,97,96,96,96,204,96,130,82,41,12,172,137,201,37,249,69,185,64,81,145,165,243,56,120,36,52,25,190,41,76,92,155,124,42,247,38,219,129,27,199,237,179,18,25,179,50,56,83,24,88,82,50,147,75,18,25,19,5,128,144,35,49,32,209,32,67,32,11,13,100,48,162,138,129,173,0,17,76,41,12,118,97,169,69,121,169,190,129,14,101,32,58,183,80,215,80,15,198,202,72,77,76,201,73,45,46,214,203,5,18,137,233,153,121,233,122,197,101,201,122,201,57,165,197,37,169,69,122,57,249,201,137,57,164,185,23,228,46,132,155,25,72,113,51,72,43,0,185,179,95,4>> 00:04:03.346 [info] Opening LevelDB SWC database at "./data/swc_meta/meta1" 00:04:03.368 [info] Opening LevelDB SWC database at "./data/swc_meta/meta2" 00:04:03.377 [info] Opening LevelDB SWC database at "./data/swc_meta/meta3" 00:04:03.386 [info] Opening LevelDB SWC database at "./data/swc_meta/meta4" 00:04:03.395 [info] Opening LevelDB SWC database at "./data/swc_meta/meta5" 00:04:03.404 [info] Opening LevelDB SWC database at "./data/swc_meta/meta6" 00:04:03.417 [info] Opening LevelDB SWC database at "./data/swc_meta/meta7" 00:04:03.425 [info] Opening LevelDB SWC database at "./data/swc_meta/meta8" 00:04:03.434 [info] Opening LevelDB SWC database at "./data/swc_meta/meta9" 00:04:03.444 [info] Opening LevelDB SWC database at "./data/swc_meta/meta10" 00:04:03.493 [info] Try to start vmq_swc: ok 00:04:03.530 [info] Opening LevelDB database at "./data/msgstore/1" 00:04:03.539 [info] Opening LevelDB database at "./data/msgstore/2" 00:04:03.547 [info] Opening LevelDB database at "./data/msgstore/3" 00:04:03.556 [info] Opening LevelDB database at "./data/msgstore/4" 00:04:03.564 [info] Opening LevelDB database at "./data/msgstore/5" 00:04:03.572 [info] Opening LevelDB database at "./data/msgstore/6" 00:04:03.581 [info] Opening LevelDB database at "./data/msgstore/7" 00:04:03.589 [info] Opening LevelDB database at "./data/msgstore/8" 00:04:03.598 [info] Opening LevelDB database at "./data/msgstore/9" 00:04:03.608 [info] Opening LevelDB database at "./data/msgstore/10" 00:04:03.616 [info] Opening LevelDB database at "./data/msgstore/11" 00:04:03.624 [info] Opening LevelDB database at "./data/msgstore/12" 00:04:03.658 [info] Try to start vmq_generic_msg_store: ok 00:04:03.763 [info] loaded 0 subscriptions into vmq_reg_trie 00:04:03.771 [info] cluster event handler 'vmq_cluster' registered 00:04:04.610 [info] Sent join request to: 'VerneMQ@vernemq-0.vernemq-headless.messaging.svc.cluster.local' 00:04:04.615 [info] Unable to connect to 'VerneMQ@vernemq-0.vernemq-headless.messaging.svc.cluster.local'

ioolkos commented 1 year ago

00:04:04.615 [info] Unable to connect to 'VerneMQ@vernemq-0.vernemq-headless.messaging.svc.cluster.local'

There is no connectivity on the Erlang distribution level, judging from that log line. net_kernel:connect_node/1 fails. Whether this comes from one of the configs, or some Kubernetes configs (maybe done previously), I don't know.


:point_right: Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq :point_right: Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

avinakollu commented 1 year ago

@ioolkos Can you please tell me which endpoint this tries to connect to? I will verify if the connection is successful. As I said, this is an ipv6 only cluster where I had to change the listeners to get it to this point.

ioolkos commented 1 year ago

I see, then all listeners are IPv6 but the Erlang distribution is not enabled for IPv6.

For an idea on the ports involved (IPv4) see: https://docs.vernemq.com/vernemq-clustering/communication

But in any case really need to enable IPv6 in the noderunner script (see my remark above); otherwise the join command will not work.


:point_right: Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq :point_right: Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

avinakollu commented 1 year ago

@ioolkos Does this mean that we will have to wait for https://github.com/vernemq/vernemq/issues/1664 to be fixed first?

Is there an expected release date for this so we can plan accordingly?

Thanks for your help once again

ioolkos commented 1 year ago

@avinakollu yes, that's the context. I count on having this fixed in the next release, but I have no ETA. Do you plan on supporting the VerneMQ project?


:point_right: Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq :point_right: Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

ioolkos commented 1 year ago

@avinakollu here's the PR that builds the nodetool/vmq-admin script dynamically to adapt for ipv4 or ipv6: https://github.com/vernemq/vernemq/pull/2134 Once a new release is out, we'll need to ensure this works in Docker too. (it should, but you never know). For a normal build it works perfectly.


:point_right: Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq :point_right: Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.