vernemq / docker-vernemq

VerneMQ Docker image - Starts the VerneMQ MQTT broker and listens on 1883 and 8080 (for websockets).
https://vernemq.com
Apache License 2.0
177 stars 230 forks source link

Kubernetes healthcheck gives access denied #386

Closed pbwur closed 4 months ago

pbwur commented 6 months ago

Hi,

I'm using the 2.0.0 version of Vernemq with he helmchart. Unfortunately the pod in Kubernetes remains unhealthy. The errormessage is:

Readiness probe failed: Get "http://10.244.76.200:8888/health": dial tcp 10.244.76.200:8888: connect: connection refused

From with the pod using curl with the url http://localhost:8888/health the response is as expected: {"status":"OK"} It seems the used IP address is the problem.

Using version 2.0.0-rc1 works ok. So looking for the difference here

ioolkos commented 6 months ago

@pbwur Thanks, The change must be in PR #380, #382, #384 or #385 then. What does the Verne log tell?

@ashtonian does this ring a bell to you, from the changes to add optional listeners?


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq 👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

pbwur commented 6 months ago

ik zi niks on de logging dat wijst op een probleem bij de healthcheck. When the first pod (of 3) starts there are a lot of log statements like:

vmq_swc_store:handle_info/2:555: Replica meta4: Can't initialize AE exchange due to no peer available

After a while VerneMq exists. But before that I'm able to execute the healthcheck using http://localhost:8888/health successfully.

024-05-02T08:53:35.711676+00:00 [debug] <0.292.0> vmq_swc_store:handle_info/2:555: Replica meta9: Can't initialize AE exchange due to no peer available 2024-05-02T08:53:36.920696+00:00 [debug] <0.247.0> vmq_swc_store:handle_info/2:555: Replica meta4: Can't initialize AE exchange due to no peer available 2024-05-02T08:53:37.434670+00:00 [debug] <0.238.0> vmq_swc_store:handle_info/2:555: Replica meta3: Can't initialize AE exchange due to no peer available 2024-05-02T08:53:37.790656+00:00 [debug] <0.283.0> vmq_swc_store:handle_info/2:555: Replica meta8: Can't initialize AE exchange due to no peer available 2024-05-02T08:53:38.419727+00:00 [debug] <0.301.0> vmq_swc_store:handle_info/2:555: Replica meta10: Can't initialize AE exchange due to no peer available 2024-05-02T08:53:38.744695+00:00 [debug] <0.229.0> vmq_swc_store:handle_info/2:555: Replica meta2: Can't initialize AE exchange due to no peer available 2024-05-02T08:53:40.392832+00:00 [debug] <0.265.0> vmq_swc_store:handle_info/2:555: Replica meta6: Can't initialize AE exchange due to no peer available 2024-05-02T08:53:41.044680+00:00 [debug] <0.256.0> vmq_swc_store:handle_info/2:555: Replica meta5: Can't initialize AE exchange due to no peer available 2024-05-02T08:53:41.835692+00:00 [debug] <0.220.0> vmq_swc_store:handle_info/2:555: Replica meta1: Can't initialize AE exchange due to no peer available 2024-05-02T08:53:42.212673+00:00 [debug] <0.292.0> vmq_swc_store:handle_info/2:555: Replica meta9: Can't initialize AE exchange due to no peer available I'm the only pod remaining. Not performing leave and/or state purge. 2024-05-02T08:53:42.465663+00:00 [debug] <0.274.0> vmq_swc_store:handle_info/2:555: Replica meta7: Can't initialize AE exchange due to no peer available 2024-05-02T08:53:42.839671+00:00 [debug] <0.283.0> vmq_swc_store:handle_info/2:555: Replica meta8: Can't initialize AE exchange due to no peer available 2024-05-02T08:53:42.944858+00:00 [notice] <0.44.0> application_controller:info_exited/3:2129: Application: vmq_server. Exited: stopped. Type: permanent. 2024-05-02T08:53:42.945013+00:00 [notice] <0.44.0> application_controller:info_exited/3:2129: Application: stdout_formatter. Exited: stopped. Type: permanent.

ioolkos commented 6 months ago

Those "Replica" logs are normal when you have debug log level on. I guess Kubernetes terminates the pods here, since it cannot reach the health endpoint.


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq 👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

ashtonian commented 6 months ago

Probably need to add this back: https://github.com/vernemq/docker-vernemq/pull/382/files#diff-95359b2d5d846bb085015977b06cde6a1facdc4ac553c06adb7d12e47aa39373L224-L226 May need to add the cluster port back as well.

ioolkos commented 6 months ago

@ashtonian Thanks, I reverted this here: https://github.com/vernemq/docker-vernemq/pull/387 cc @pbwur let's see whether this resolves the issue. I can build new images tomorrow.


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq 👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

ioolkos commented 6 months ago

@pbwur I have now uploaded 2.0.0 images with a tentative fix to Dockerhub. Can you test one of those to check whether the Kubernetes Health check works now?


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq 👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

pbwur commented 6 months ago

@ioolkos, it seems to work now. All 3 nodes of the cluster are starting now. Thanks for the great response!

Although probably not related, I do get an error with the second node after the first node starts successfully. After I delete the persistentStoraceClaim and start the cluster again, everything is ok.

This is part of the logging:

2024-05-03T09:00:36.793105+00:00 [info] <0.686.0> vmq_diversity_app:start/2:85: enable auth script for postgres "./share/lua/auth/postgres.lua" Error! Failed to eval: vmq_server_cmd:node_join('VerneMQ@vernemq-0.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local')

Runtime terminating during boot ({{badkey,{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>}},[{erlang,map_get,[{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>},#{}],[{error_info,#{module=>erl_erts_errors}}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,history,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,230}]},{vmq_swc_peer_service,attempt_join,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_peer_service.erl"},{line,57}]},{vmq_server_cli,'-vmq_cluster_join_cmd/0-fun-1-',3,[{file,"/opt/vernemq/apps/vmq_server/src/vmq_server_cli.erl"},{line,516}]},{clique_command,run,1,[{file,"/opt/vernemq/_build/default/ 2024-05-03T09:00:37.798996+00:00 [error] <0.9.0>: Error in process <0.9.0> on node 'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local' with exit value:, {{badkey,{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>}},[{erlang,map_get,[{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>},#{}],[{error_info,#{module => erl_erts_errors}}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,history,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,230}]},{vmq_swc_peer_service,attempt_join,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_peer_service.erl"},{line,57}]},{vmq_server_cli,'-vmq_cluster_join_cmd/0-fun-1-',3,[{file,"/opt/vernemq/apps/vmq_server/src/vmq_server_cli.erl"},{line,516}]},{clique_command,run,1,[{file,"/opt/vernemq/_build/default/lib/clique/src/clique_command.erl"},{line,87}]},{vmq_server_cli,command,2,[{file,"/opt/vernemq/apps/vmq_server/src/vmq_server_cli.erl"},{line,45}]}]}

Crash dump is being written to: /erl_crash.dump...[os_mon] memory supervisor port (memsup): Erlang has closed [os_mon] cpu supervisor port (cpu_sup): Erlang has closed Stream closed EOF for mdtis-poc-mqtt/vernemq-1 (vernemq)

hsudbrock commented 5 months ago

@pbwur I have the same issue as the one you describe in your last comment above: When restarting a pod of the vernemq stateful set, I get the exact same error; only after deleting the PVC (and underlying PV) and restarting the pod it comes up again. This issue started with 2.0.0, I did not have it with 1.13.

Did you, by any chance, resolve that issue on your side? If yes, I would be thankful to hear how :)

ioolkos commented 5 months ago

@pbwur @hsudbrock Currently looking into the PVC related start error; it looks like some sort of regression.

The following setting in vernemq.conf should prevent it: (by switching to the previous join logic)

vmq_swc.prevent_nonempty_join = off

pbwur commented 5 months ago

Hi @hsudbrock and @ioolkos , apologies for the late response. That issue did still happen here also. It would be great if that setting would fix it. What would be the correct environment variable to set it? DOCKER_VERNEMQ_VMQ_SWCPREVENTNONEMPTY__JOIN?

ioolkos commented 5 months ago

@pbwur DOCKER_VERNEMQ_VMQ_SWC__PREVENT_NONEMPTY_JOIN

(translate . to __, keep _ as _)

hsudbrock commented 5 months ago

Thanks for the hint and the PR for fixing the issue! For me, so far it looks good, i.e., disabling the nonempty join check resulted in no errors when restarting my vernemq cluster so far.