statianzo / que-web

A web interface for the Que queue
BSD 3-Clause "New" or "Revised" License
88 stars 50 forks source link

Rescheduled job is not worked if there more than one jobs present when rescheduling #34

Closed owst closed 7 years ago

owst commented 7 years ago

If there are >1 jobs in the que_jobs table (and at least one has failed), rescheduling a failed job will make it show in "running jobs" but it will never actually run (unless the que-web process exits).

This is a very similar issue to #31 - but whereas there, we were locking all jobs, we now lock a single job N times (once per row in the jobs table). As per the documentation,

A lock can be acquired multiple times by its owning process; for each completed lock request there must be a corresponding unlock request before the lock is actually released.

Unfortunately, we only issue a single unlock request, meaning that, unless the failed job was the only job, we will not issue enough unlock requests and the job will remain locked by que-web (which has the effect of making it show in the Running section).

This issue, and #31 and #30 have highlighted the difficulty in trying to construct a (correct) atomic SQL statement to be executed by Postgres to lock/modify/unlock a single job. Many apologies for the bugs that we've introduced, they've highlighted some behaviour that is quite subtle! I think the lack of tests for the tricky behaviour hasn't helped, but we could have also done a better job of deeply understanding the queries we were writing.

I have a fix that will abandon the atomic query approach, and instead use separate DB queries to lock, then modify, then unlock a given job, which is simpler and easier to reason about, for a negligible reduction in performance.