Closed jimmymathews closed 1 month ago
With scaling (dynamic creation and removal of worker processes), the probability of occasional failed jobs increases (due to differences in memory availability, for example) and we finally need to track these more carefully.
To finish this issue, I am implementing a flag on quantitative_feature_value_queue
that workers set when they begin computation (rather than pulling off the queue). This may also include a timestamp, so a "watchdog" step can note probably-failed jobs, log a warning, then clean up the corresponding features.
The above was completed.
Addresses #368.
Also fixes the job queue logic so that newly created worker containers will first review the queue, without prompting from the message channel it is
LISTEN
ing on in postgres.