Infinite loop checking for completed metric

jimmymathews commented 4 months ago

In some cases, the API handler for metric computation requests will repeatedly report that since a job has finished, feature # may be complete. This is supposed to happen only when receiving a postgres NOTIFY notification that a job completed, but for some reason this can be infinitely repeated.

The real problem here is that for some reason, jobs can be dropped and never completed. I observed a case where this was just 1 out of ~550 (LUAD), and another where it was a few dozen out of ~550. Here dropped means that the job was removed from the queue but no computed value was ever inserted. I don't know what conditions produce this, because in the observed cases a repeat attempt (after deleting the feature's computed values) worked to completion.

jimmymathews commented 4 months ago

While waiting to fully diagnose this problem, as possible workaround is to implement some "watchdog" that deletes all values of a feature that is not complete when it seems that all worker instances are idle for sufficient time.

jimmymathews commented 4 months ago

For visibility, we may also need to finally implement an additional queue table listing the pending jobs. I avoided this to try to keep the complexity low, but the current system where workers are autonomous and do not communicate or block anything makes debugging challenging.

nadeemlab / SPT

Infinite loop checking for completed metric #339