kbaseattic / assembly

An extensible framework for genome assembly.
MIT License
12 stars 14 forks source link

Fix kill queue #276

Closed cbun closed 9 years ago

cbun commented 9 years ago

There is still a case where the kill command can miss:

  1. ar-run -f BIG_FILE
  2. ar-kill -j JOB
  3. job is "Queued" in mongo, but routing to a compute node. We now still publish the kill command to catch this.
  4. kill request callback checks to see if kill_id is in job_list, but job isn't added to job_list until after data transfer, thus, skips the kill queue.

So, I moved the lock_acquire back to before file transfer. This will only allow one thread to transfer at a time, like we discussed. Do you see any downside to this?

Also, there was a bug where jobs weren't popped from job_list on terminate/exception. Moved this to a finally statement.

levinas commented 9 years ago

Nice catch. I will have to merge manually.

The problem with moving the lock_acquire back is that we have a big block of code below that. This block is not protected by try/except/finally, so any exception could lead to lock release not being called and thus a deadlock.

levinas commented 9 years ago

I have another case where a job escapes termination: kill_callback() correctly identifies "on this node", but the job proceeds to completion.

I prefer serializing data download on each node as I notice poor performance on /mnt. I'm thinking how we could avoid putting a big block of code in try/exception while still keeping acquire/release statements balanced.