joblib / loky

Robust and reusable Executor for joblib
http://loky.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
520 stars 47 forks source link

Implement max_tasks_per_child #373

Open ddelange opened 1 year ago

ddelange commented 1 year ago

Hi 👋

Analogous to concurrent.futures.ProcessPoolExecutor's max_tasks_per_child (added in cp3.11) and multiprocessing.pool.Pool's maxtasksperchild (added in cp3.2) keyword arguments, it would be great to be able to control after how many completed tasks a loky subprocess is flushed and replaced with a new subprocess.

Our dask workers are currently consistently facing loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.:

Most likely caused by upstream memory leaks in lxml, hitting our 60GiB mem limit over time due to running the same loky pool subprocesses over 5+ hours. Periodically flushing the workers (spawn start method) will most likely fix these errors.

Many thanks!

ogrisel commented 1 year ago

That sounds like a good idea, feel free to submit a PR and link it back to this issue.

ddelange commented 1 year ago

I think this is a pretty low-level change in need of someone who's deep into the source code, especially with regard to start methods: