Keepalive queue backs up while results continue to be processed

sensu / sensu-transport

The Sensu transport abstraction library.

MIT License

14 stars 19 forks source link

Keepalive queue backs up while results continue to be processed #38

Closed michaelschlies closed 7 years ago

michaelschlies commented 8 years ago

I am periodically seeing an instance of Sensu falls behind on the keepalive exchange/queue but stays up to date on the results exchange/queue. The RabbitMQ prefetch is currently set to 400 to just to minimize the impact of this effects. I have 6 sensu servers, 4 RabbitMQ servers, 2 (active/passive) Redis servers, approximately 1200 clients that are self-service by their infrastructure and app owners with various checks and intervals. Deployment is managed by a wrapper to the sensu cookbook.

Sensu Version: 0.25.3
RabbitMQ Server: 3.6.3

michaelschlies commented 8 years ago

Sensu will, eventually (2-5 hours) catch up on its own, otherwise I can purge the queue "keepalive" and result is that it stays caught up for 1 hour - 2 weeks.

cwjohnston commented 8 years ago

@khalaan what version of Erlang VM are you using to run rabbitmq? e.g. output of rpm -qa | grep erlang would be helpful.

michaelschlies commented 8 years ago

[cloud-user@sensu-ttc-production-rabbitmq-004 ~]$ rpm -qa | grep erlang erlang-erts-R16B-03.16.el7.x86_64

michaelschlies commented 8 years ago

Digging into RMQ itself, it looks happy and healthy. only 300 connections each, 200mb/ram usage, 3000 erlang connections each.

cwjohnston commented 8 years ago

@khalaan are your Sensu Servers or Sensu Clients accessing the RabbitMQ brokers via a proxy or load balancer?

michaelschlies commented 8 years ago

Yes, all via HAProxy

cwjohnston commented 8 years ago

@khalaan will you please try setting prefetch to its default value of 1 and see if that has any impact on the keepalive queue backing up?

michaelschlies commented 8 years ago

Tried resetting it to 1 for prefetch and that immediately sends all queues into a state of 100k+ messages

portertech commented 8 years ago

We probably need multiple AMQP channels, I suspect that there's an issue w/ higher prefetch values with the keepalive consumer, perhaps use a separate AMQP channel for keepalives. This complicates the RabbitMQ transport, the abstraction makes this difficult.

portertech commented 7 years ago

https://github.com/sensu/sensu/pull/1712 Fixes this 👍