tdaff / remote_ikernel

All your Jupyter kernels, on all your machines, in one place.
https://pypi.org/project/remote_ikernel/
BSD 2-Clause "Simplified" License
17 stars 14 forks source link

Dead kernel submits new job to queue on restart #12

Open tdaff opened 8 years ago

tdaff commented 8 years ago

Original report by Scott Field (Bitbucket: sfield83, ).


Hi Tom,

Sometimes I'll need to wait a few minute for my job to start (using the very awesome feature of remotely starting jobs from a batch submission system). If the kernel dies, upon restart a brand new job is submitted. This results in two jobs sitting on the queue.

So far, I've only had a problem on PBS systems.

Best, Scott

tdaff commented 8 years ago

Original comment by Tom Daff (Bitbucket: tdaff, GitHub: tdaff).


Hi Scott,

Thanks for the report, and I'm happy that you are still finding the code useful :)

I'm still thinking how to deal with this. I think the main issue is that the kernel is run in a subprocess and upon restart the subprocess gets killed completely and a new one starts. The original PBS job probably lingers around until it times out. Does the job go away eventually by itself (maybe 10 mins)?

Are you wondering whether it is possible to re-use a job for subsequent kernels? That might be possible, but would need significant re-engineering to persist an active connection between different python processes. Though I know it is annoying when jobs take a while in the queue.

tdaff commented 8 years ago

Original comment by Scott Field (Bitbucket: sfield83, ).


Hi Tom,

I've never allowed the rogue job to linger too long, so I'm not really sure. But would the queuing system even become alerted to the fact that the kernel has died? The job might just sit on the queue until it starts running, and then run to completion (ie nothing happens until the requested wall time is exhausted).

Anyway, its a very minor issue. It would be very nice to re-use the job for the restarted kernel. Or, to do a full cleanup, if the main process could call the system's qdel command immediately after the kernel subprocess is killed.

Scott