uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Sending FINISHED message to workers when main process dies #474

Closed ingolfured closed 4 years ago

ingolfured commented 4 years ago

The problem

If you are using Petastorm in process mode and the main process dies unexpectedly, it leaves the workers running until the user manually kills them. In some environments, this can be quite tricky, especially if you don't SSH access to the box.

The solution

I propose a solution where we use a separate process to monitor the main process. If the main process dies unexpectedly, this new process, ProcessMonitor sends a FINISHED message to the alive workers via a new channel.

claassistantio commented 4 years ago

CLA assistant check
All committers have signed the CLA.

selitvin commented 4 years ago

Do you think we have to have another process monitoring the master? Having an additional process in the system might introduce additional uncertainty and potentially flakiness (same kind of multiprocess system issue you are trying to address with this PR).

I am curious, did you consider to make workers terminate themselves. There could be several options, I guess:

ingolfured commented 4 years ago

Interesting, getting a signal when the parent dies works but only on Linux. Would that be ok? Otherwise, option 2 seems more robust than the solution I proposed

ingolfured commented 4 years ago

Closed since a better implementation is here: #482