roryk / ipython-cluster-helper

Tool to easily start up an IPython cluster on different schedulers.
148 stars 23 forks source link

Bug in SLURM job deletion #31

Closed mwojcikowski closed 9 years ago

mwojcikowski commented 9 years ago

When scancel used with --signal option, then only the first job of an array is deleted. If the --signal option is deleted, then everything is functioning correctly.

roryk commented 9 years ago

Hi @mwojcikowski,

Thanks, we are using --signal so that the job doesn't show up as cancelled. Do you happen to know if there is a way to delete the whole array so that the job is marked as finished instead of cancelled?

mwojcikowski commented 9 years ago

Mapages state that using --signal bypasses slurmctld and goes straight to slurmd, which most probably have no clue of array job.

The name or number of the signal to send.  If this option is not used the spec-
ified  job  or step will be terminated. Note. If this option is used the signal
is sent directly to the slurmd where the job is running bypassing the slurmctld
thus  the  job state will not change even if the signal is delivered to it. Use
the scontrol command if you want the job state change be known to slurmctld.

Most probably the job ends up as "not-canceled" as a side-effect of not notifying the ctld of the change.

I guess the correct behaviour would be achieved if you enumerate all jobs during the scancel call: scancel --signal=KILL 1234 1235 1236 PS. I didn't check if --signal works with underscore notation of array jobs. PS2. slurm 14.11.8

roryk commented 9 years ago

Thanks Maciej,

I ended up just dropping passing KILL as the signal like you suggested. I tried a couple different ways of sending the KILL signal to the job arrays but I always ended up with a couple of the engines not killed.