wpool_queue_manager crash cripples worker pool

marcsugiyama commented 9 years ago

wpool_pool supervises the wpool_queue_manager and workers. It is a one_for_one supervisor.

When a worker (wpool_process) starts, it registers itself as available with wpool_queue_manager. wpool_queue_manager uses the loop State variable to keep track of available workers. If wpool_queue_manager crashes, it loses this state information. Because the wpool_pool supervisor is one_for_one, wpool_queue_manger restarts after a crash, but the workers continue running and do not know to reregister. Hence, after a wpool_queue_manager crash, work is never assigned because the wpool_queue_manager does not appear to have any available workers.

Suggestion - rearrange the supervision tree so that there is a one_to_one wpool_process_sup that manages the workers. Make this a child to the wpool_pool supervisor. Change the wpool_pool supervisor to a one_for_all or rest_for_one to restart the workers if the wpool_queue_manager crashes.

elbrujohalcon commented 9 years ago

@marcsugiyama this is now a PR as well :)

marcsugiyama commented 9 years ago

The code looks fine.

I wonder if the restart intensity for the workers should be higher. If the workers are all doing the same thing, is there a high likelihood they will all crash due to a systematic failure? That is, might we expect if one worker fails they all will in quick succession (e.g., workers are database connections and the database is overloaded causing timeouts). If that's the case, maybe the maximum intensity should be calculated based on the number of workers, e.g., three times the number of workers. That gives each worker a chance to crash three times before the supervisor gives up. Would that be less disruptive to the whole system? Or is it better for the supervisor to crash and restart all the workers if we're on the leading edge of a detecting a systemic problem.

elbrujohalcon commented 9 years ago

@marcsugiyama in any case, the restart strategy is configurable. It can be passed in as an option when creating the pool. I just left the current one as the default there.

marcsugiyama commented 9 years ago

OK - that's reasonable.

inaka / worker_pool

wpool_queue_manager crash cripples worker pool #36