Closed jimmymathews closed 3 months ago
While waiting to fully diagnose this problem, as possible workaround is to implement some "watchdog" that deletes all values of a feature that is not complete when it seems that all worker instances are idle for sufficient time.
For visibility, we may also need to finally implement an additional queue table listing the pending jobs. I avoided this to try to keep the complexity low, but the current system where workers are autonomous and do not communicate or block anything makes debugging challenging.
In some cases, the API handler for metric computation requests will repeatedly report that since a job has finished, feature
#
may be complete. This is supposed to happen only when receiving a postgresNOTIFY
notification that a job completed, but for some reason this can be infinitely repeated.The real problem here is that for some reason, jobs can be dropped and never completed. I observed a case where this was just 1 out of ~550 (LUAD), and another where it was a few dozen out of ~550. Here dropped means that the job was removed from the queue but no computed value was ever inserted. I don't know what conditions produce this, because in the observed cases a repeat attempt (after deleting the feature's computed values) worked to completion.