Closed cbun closed 9 years ago
Nice catch. I will have to merge manually.
The problem with moving the lock_acquire back is that we have a big block of code below that. This block is not protected by try/except/finally, so any exception could lead to lock release not being called and thus a deadlock.
I have another case where a job escapes termination: kill_callback() correctly identifies "on this node", but the job proceeds to completion.
I prefer serializing data download on each node as I notice poor performance on /mnt. I'm thinking how we could avoid putting a big block of code in try/exception while still keeping acquire/release statements balanced.
There is still a case where the kill command can miss:
ar-run -f BIG_FILE
ar-kill -j JOB
So, I moved the lock_acquire back to before file transfer. This will only allow one thread to transfer at a time, like we discussed. Do you see any downside to this?
Also, there was a bug where jobs weren't popped from job_list on terminate/exception. Moved this to a
finally
statement.