sensu / sensu-transport

The Sensu transport abstraction library.
MIT License
14 stars 19 forks source link

keepalives silently fail #14

Open julian7 opened 9 years ago

julian7 commented 9 years ago

I have ~30 machines reporting to sensu via rabbitmq, and two of them (they are, and only they are in the same datacenter) flapping. It looks like most of the reports go through, but keepalive. The servers report "publishing keepalives" happily, but the messages don't arrive until some point where the transport reconnects.

A reconnect fixes the issues temporarily, but sooner or later publish() to certain channels stop working (while others do), until the next full reconnect.

Any ideas?

portertech commented 9 years ago

@julian7 have you tried using AMQP heartbeats to monitor their connections? If not, to enable them, you can add "heartbeat": 60 to your "rabbitmq": {} connection configuration.

julian7 commented 9 years ago

I looked a bit deeper, and I've found at some point eventmachine's select() stop offering rabbitmq's FD for writing. The local buffer grows and grows until it (hopefully) triggers a reset. The new connection works for a while too, but then it stops accepting writes again.

It would be a wild guess to say it's an eventmachine issue, to say the least, nevertheless I haven't seen yet a connection fd not allowing writing into it after a while.

I'm also a bit confused about the role of eventmachine 1.0.3 and sensu-em 2.5.2 bundled. sensu-em seems like a patched eventmachine 1.0.7, yet em1.0.3's extension gets loaded, not mentioning all of the other gems which also require eventmachine gem. Now one of my nodes is in a state when connecting to rabbitmq via SSL returns an arity error, but then it's much easier to monitor traffic in the clear :)

portertech commented 9 years ago

@julian7 what version of erlang and rabbitmq?

julian7 commented 9 years ago
% rpm -q rabbitmq-server erlang
rabbitmq-server-3.4.4-1.noarch
erlang-17.5-1.el6.x86_64

I've deployed them with sensu-chef's sensu::rabbitmq, with stock settings.

By the way setting up a heartbeat made the issue fail fast, yet they generate events.

portertech commented 7 years ago

@julian7 Sensu has gone through many changes since this issue was opened. Does this issue persist?

julian7 commented 7 years ago

We also reconfigured our traffic optimizer not to compress rabbitmq data. If the issue still persist, it must have minimal impact.

On Nov 22, 2016, at 12:12 AM, Sean Porter notifications@github.com wrote:

@julian7 Sensu has gone through many changes since this issue was opened. Does this issue persist?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.