vernemq / vernemq

A distributed MQTT message broker based on Erlang/OTP. Built for high quality & Industrial use cases. The VerneMQ mission is active & the project maintained. Thank you for your support!
https://vernemq.com
Apache License 2.0
3.24k stars 395 forks source link

Upgrade Hackney from 1.8.6 to >1.12 to Fix Connection_Timeouts #1166

Closed JohnCMcDonough closed 5 years ago

JohnCMcDonough commented 5 years ago

Environment

discovery_kubernetes = "1"
listener.tcp.localhost = 127.0.0.1:1883
listener.ws.localhost = 127.0.0.1:8080
persistent_client_expiration = 1d
max_online_messages = "-1"
max_offline_messages = "-1"
max_inflight_messages = "0"
allow_register_during_netsplit = "off"
allow_publish_during_netsplit = "on"
allow_subscribe_during_netsplit = "on"
allow_unsubscribe_during_netsplit = "on"
max_client_id_size = "100"
allow_anonymous = "off"
shared_subscription_policy = prefer_local
plugins.vmq_passwd = "off"
plugins.vmq_acl = "off"
plugins.vmq_webhooks = "on"
vmq_webhooks.authwebhook_register.hook = auth_on_register
vmq_webhooks.authwebhook_register.endpoint = http://<service_name>/v1/verne_webhooks/auth_on_register
vmq_webhooks.authwebhook_publish.hook = auth_on_publish
vmq_webhooks.authwebhook_publish.endpoint = http://<service_name>/v1/verne_webhooks/auth_on_publish
vmq_webhooks.authwebhook_subscribe.hook = auth_on_subscribe
vmq_webhooks.authwebhook_subscribe.endpoint = http://<service_name>/v1/verne_webhooks/auth_on_subscribe
vmq_webhooks.register.hook = on_register
vmq_webhooks.register.endpoint = http://<service_name>/v1/device_state_webhooks/on_register
vmq_webhooks.client_offline.hook = on_client_offline
vmq_webhooks.client_offline.endpoint = http://<service_name>/v1/device_state_webhooks/on_client_offline
vmq_webhooks.client_gone.hook = on_client_gone
vmq_webhooks.client_gone.endpoint = http://<service_name>/v1/device_state_webhooks/on_client_gone
vmq_webhooks.pool_max_connections = "50000"
vmq_webhooks.pool_timeout = "5000"
listener.tcp.default = 0.0.0.0:1883
listener.ws.default = 0.0.0.0:8080

Expected behavior

Webhooks continue to function, even after a network failure.

Actual behaviour

After receiving many ECONN Resets due to networking issues in the cluster, the webhooks no longer function until after Vernemq instances have been rebooted. We just get a constant stream of Connection Timeouts in the logs. Even if we remove all load, and attempt to connect a single device, it fails.

We've been able to decrease how often this happens by setting:

vmq_webhooks.pool_max_connections = "50000"
vmq_webhooks.pool_timeout = "5000"

This prolongs the issue, but does not solve it. It appears that the version of Hackney being used by Vernemq is hackney/1.8.6. This has a known issue describing this exact problem.

https://github.com/benoitc/hackney/issues/462

JohnCMcDonough commented 5 years ago

I think this may be the root cause of other issues such as: https://github.com/vernemq/vernemq/issues/556 and https://github.com/vernemq/vernemq/issues/612

ioolkos commented 5 years ago

Thanks @JohnCMcDonough for your work on this! So apparently this fix for the Hackney socket leak wasn't enough (or it was fixing some other issue actually): https://github.com/vernemq/vernemq/commit/b9e966280a19bd7c312e09bacdb7599870770c0c

Do we have an indication that Hackney 1.12 is actually free of this issue? (cc @larshesel @dergraf )

larshesel commented 5 years ago

I've created a PR (#1168) which upgrades hackney to version 1.15.1 - can you test if the PR solves the issue?

In any case a lot of bugs have been fixed since 1.8.6 and 1.15.1 so it was about time to get it upgraded.