Subprocess deadlock with multiple threads

philipstarkey commented 5 years ago

Original report (archived issue) by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

I'm ocassionally seeing the analysislib-mloop analysis subprocess hanging when communicating with runmanager. It's not spinning the CPU, it doesn't raise an exception, it's just hanging, so it looks like a deadlock. If I add printlines around the zeromq calls to communicate with runmanager, it's clear that the code is either hanging in the zmq poll call to hear back from runmanager, or it's hanging in the subsequent print statement. I suspect the print statement, because if I disable output redirection for lyse routines (and look at my printlines in the terminal instead), then I don't see the hang.

It doesn't have anything to do with the deadlock in labscript_utils PR 84, because I observe it without that PR merged in.

I have only observed it when using the Gaussian process MLOOP controller. Since I doubt it has to do with the actual optimisation algorithm, this is probably because the Gaussian process is more computationally intensive, so this affects timing.

I am guessing the output redirection isn't as threadsafe as I thought it was. Threads are sharing a zeromq socket, though they serialise access to it with a lock. Nonetheless perhaps this is not enough.

Will pull out a debugger to inspect the Python process whilst hanging to see where it's at.

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Edited issue description

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Edited issue description

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

OK, it looks like this is because on unix, the multiprocessing module fork()s new processes by default, and the M-LOOP Gaussian Process uses a multiprocessing.Process for its Learner() class. The other learners use a thread.

However, zmq is not fork-safe. So zmq is breaking at some random point in the future after the fork. I guess I'm surprised it works for even a short time before crashing.

Anyway, the solution is not to fork. In Python ≥ 3.4 one can do multiprocessing.set_start_method('spawn'). So I'll make it do that.

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

changed state from "new" to "resolved"

Force multiprocessing to spawn new processes instead of forking them on >= 3.4.

This fixes issue #49 which was caused by forking, but zeromq not being fork-safe.

→ \<\<cset 6beaf9d0c6acf8a4579e0d55cc91d52a013f5284>>

labscript-suite-temp-2 / lyse

Subprocess deadlock with multiple threads #49