michaelklishin commented 6 years ago

Currently in environments with a lot of virtual hosts their recovery can take a long time, even if each has a moderate amount of data to recover.

This can be attributed to 3 things:

As of https://github.com/rabbitmq/rabbitmq-server/issues/567 each virtual host has separate messages stores to recover.
As of https://github.com/rabbitmq/rabbitmq-server/pull/1608 virtual host startup is synchronised with its counterparts on all cluster nodes.
Virtual host recovery on an individual node is performed sequentially.

The first two items above are by design and their penalty is a trade-off we are willing to accept. The sequential recovery part is not really a design decision and can be revisited.

This issue is about investigating concurrent virtual host recovery and its "unknown unknowns". Just like with other areas of node data recovery we want to limit the concurrency rate, e.g. by using a separate fixed size work pool.

There are currently two steps to recovery: "init", which interacts with the virtual host process supervisor (and possibly cannot be done concurrently) and "recover", which does most of the work (and is closer in what we already do concurrently to recover queue state). Making the latter step concurrent is our current area of focus.

Per discussion with @kjnilsson.

kjnilsson commented 6 years ago

An alternative to changing the init sequence is to use a permanent and configurable pool of vhost supervisors instead. Then almost all code can stay the same apart from some additional logic to choose a suitable pool member for a newly created vhost and such.

michaelklishin commented 6 years ago

1650 makes this problem significantly less severe (at least on a few workloads we tested with).

rabbitmq / rabbitmq-server

Concurrent virtual host state recovery #1648

1650 makes this problem significantly less severe (at least on a few workloads we tested with).