rabbitmq / rabbitmq-server

Open source RabbitMQ: core server and tier 1 (built-in) plugins
https://www.rabbitmq.com/
Other
12.22k stars 3.91k forks source link

Concurrent virtual host state recovery #1648

Open michaelklishin opened 6 years ago

michaelklishin commented 6 years ago

Currently in environments with a lot of virtual hosts their recovery can take a long time, even if each has a moderate amount of data to recover.

This can be attributed to 3 things:

The first two items above are by design and their penalty is a trade-off we are willing to accept. The sequential recovery part is not really a design decision and can be revisited.

This issue is about investigating concurrent virtual host recovery and its "unknown unknowns". Just like with other areas of node data recovery we want to limit the concurrency rate, e.g. by using a separate fixed size work pool.

There are currently two steps to recovery: "init", which interacts with the virtual host process supervisor (and possibly cannot be done concurrently) and "recover", which does most of the work (and is closer in what we already do concurrently to recover queue state). Making the latter step concurrent is our current area of focus.

Per discussion with @kjnilsson.

kjnilsson commented 6 years ago

An alternative to changing the init sequence is to use a permanent and configurable pool of vhost supervisors instead. Then almost all code can stay the same apart from some additional logic to choose a suitable pool member for a newly created vhost and such.

michaelklishin commented 6 years ago

1650 makes this problem significantly less severe (at least on a few workloads we tested with).