If there are >1 jobs in the que_jobs table (and at least one has failed), rescheduling a failed job will make it show in "running jobs" but it will never actually run (unless the que-web process exits).
This is a very similar issue to #31 - but whereas there, we were locking all jobs, we now lock a single job N times (once per row in the jobs table). As per the documentation,
A lock can be acquired multiple times by its owning process; for each completed lock request there must be a corresponding unlock request before the lock is actually released.
Unfortunately, we only issue a single unlock request, meaning that, unless the failed job was the only job, we will not issue enough unlock requests and the job will remain locked by que-web (which has the effect of making it show in the Running section).
This issue, and #31 and #30 have highlighted the difficulty in trying to construct a (correct) atomic SQL statement to be executed by Postgres to lock/modify/unlock a single job. Many apologies for the bugs that we've introduced, they've highlighted some behaviour that is quite subtle! I think the lack of tests for the tricky behaviour hasn't helped, but we could have also done a better job of deeply understanding the queries we were writing.
I have a fix that will abandon the atomic query approach, and instead use separate DB queries to lock, then modify, then unlock a given job, which is simpler and easier to reason about, for a negligible reduction in performance.
If there are >1 jobs in the
que_jobs
table (and at least one has failed), rescheduling a failed job will make it show in "running jobs" but it will never actually run (unless the que-web process exits).This is a very similar issue to #31 - but whereas there, we were locking all jobs, we now lock a single job N times (once per row in the jobs table). As per the documentation,
Unfortunately, we only issue a single unlock request, meaning that, unless the failed job was the only job, we will not issue enough unlock requests and the job will remain locked by que-web (which has the effect of making it show in the Running section).
This issue, and #31 and #30 have highlighted the difficulty in trying to construct a (correct) atomic SQL statement to be executed by Postgres to lock/modify/unlock a single job. Many apologies for the bugs that we've introduced, they've highlighted some behaviour that is quite subtle! I think the lack of tests for the tricky behaviour hasn't helped, but we could have also done a better job of deeply understanding the queries we were writing.
I have a fix that will abandon the atomic query approach, and instead use separate DB queries to lock, then modify, then unlock a given job, which is simpler and easier to reason about, for a negligible reduction in performance.