Try reclaiming memory better in db cache process

kjnilsson commented 4 years ago

When queries get very very large it is possible that this process could grow it's heap such that full sweep collections become rare. Hibernating after purging enforces this as well as ensuring full sweep is done before every cache invalidation.

See https://github.com/rabbitmq/discussions/issues/71#issuecomment-597045215

gerhard commented 4 years ago

Before

rabbitmq-management-788-before-1080p-12fps-1fps

After

rabbitmq-management-788-after-1080p-12fps-1fps

(open in new tab to get the full 1080p size)

Is there anything left from your POV before merging? Looks good to me :shipit:

michaelklishin commented 4 years ago

A summary of those two gifs is that with this PR, peak memory usage is 5-12% lower. Perhaps more importantly, peak spikes are lower.

michaelklishin commented 4 years ago

Backported to v3.8.x.

michaelklishin commented 4 years ago

Backported to v3.7.x.

gerhard commented 4 years ago

The most important take-away from those GIFs is that the memory used by rabbit_mgmt_db_cache_queues gets reclaimed within 2-3 minutes with this patch, but it never gets reclaimed without it. Obviously, I couldn't capture "never", so I called it at 10 minutes.

The second most important take-away is that any kind of sorting on the queues page makes memory usage significant (see the GIF for exact numbers). It also takes in excess of 30 seconds (in some cases as high as 46 seconds) for the page to update. While most users would assume that something is wrong and would click around, the message is to be patient - it works, but it takes what some may perceive as a long time to update. The same behaviour would be seen if any monitoring is used that makes requests against the HTTP API, specifically GET /api/queues. This means the vast majority of monitoring solutions out there will have to deal with the same "slowness".

rabbitmq-prometheus, the new monitoring solution as of RabbitMQ v3.8.0, does not exhibit this behaviour since it uses a completely different code path. It's as if we designed it to respond within milliseconds (seconds in extreme cases) when there are large number of objects (connections, channels, queues) 😉

rabbitmq / rabbitmq-management

Try reclaiming memory better in db cache process #788

Before

After