Fix kill queue - Githubissues

cbun commented 9 years ago

There is still a case where the kill command can miss:

ar-run -f BIG_FILE
ar-kill -j JOB
job is "Queued" in mongo, but routing to a compute node. We now still publish the kill command to catch this.
kill request callback checks to see if kill_id is in job_list, but job isn't added to job_list until after data transfer, thus, skips the kill queue.

So, I moved the lock_acquire back to before file transfer. This will only allow one thread to transfer at a time, like we discussed. Do you see any downside to this?

Also, there was a bug where jobs weren't popped from job_list on terminate/exception. Moved this to a finally statement.

levinas commented 9 years ago

Nice catch. I will have to merge manually.

The problem with moving the lock_acquire back is that we have a big block of code below that. This block is not protected by try/except/finally, so any exception could lead to lock release not being called and thus a deadlock.

levinas commented 9 years ago

I have another case where a job escapes termination: kill_callback() correctly identifies "on this node", but the job proceeds to completion.

I prefer serializing data download on each node as I notice poor performance on /mnt. I'm thinking how we could avoid putting a big block of code in try/exception while still keeping acquire/release statements balanced.

kbaseattic / assembly

Fix kill queue #276