Better logs, and more resilient job queue logic

jimmymathews commented 1 month ago

Addresses #368.

Also fixes the job queue logic so that newly created worker containers will first review the queue, without prompting from the message channel it is LISTENing on in postgres.

jimmymathews commented 1 month ago

With scaling (dynamic creation and removal of worker processes), the probability of occasional failed jobs increases (due to differences in memory availability, for example) and we finally need to track these more carefully. To finish this issue, I am implementing a flag on quantitative_feature_value_queue that workers set when they begin computation (rather than pulling off the queue). This may also include a timestamp, so a "watchdog" step can note probably-failed jobs, log a warning, then clean up the corresponding features.

jimmymathews commented 1 month ago

The above was completed.

nadeemlab / SPT

Better logs, and more resilient job queue logic #370