Closed michaelschlies closed 7 years ago
Sensu will, eventually (2-5 hours) catch up on its own, otherwise I can purge the queue "keepalive" and result is that it stays caught up for 1 hour - 2 weeks.
@khalaan what version of Erlang VM are you using to run rabbitmq? e.g. output of rpm -qa | grep erlang
would be helpful.
[cloud-user@sensu-ttc-production-rabbitmq-004 ~]$ rpm -qa | grep erlang erlang-erts-R16B-03.16.el7.x86_64
Digging into RMQ itself, it looks happy and healthy. only 300 connections each, 200mb/ram usage, 3000 erlang connections each.
@khalaan are your Sensu Servers or Sensu Clients accessing the RabbitMQ brokers via a proxy or load balancer?
Yes, all via HAProxy
@khalaan will you please try setting prefetch to its default value of 1 and see if that has any impact on the keepalive queue backing up?
Tried resetting it to 1 for prefetch and that immediately sends all queues into a state of 100k+ messages
We probably need multiple AMQP channels, I suspect that there's an issue w/ higher prefetch values with the keepalive consumer, perhaps use a separate AMQP channel for keepalives. This complicates the RabbitMQ transport, the abstraction makes this difficult.
https://github.com/sensu/sensu/pull/1712 Fixes this 👍
I am periodically seeing an instance of Sensu falls behind on the keepalive exchange/queue but stays up to date on the results exchange/queue. The RabbitMQ prefetch is currently set to 400 to just to minimize the impact of this effects. I have 6 sensu servers, 4 RabbitMQ servers, 2 (active/passive) Redis servers, approximately 1200 clients that are self-service by their infrastructure and app owners with various checks and intervals. Deployment is managed by a wrapper to the sensu cookbook.