keenon / AddBiomechanics

A tool to automatically process and share biomechanics data
https://addbiomechanics.org/
Other
31 stars 5 forks source link

Re-queue in a loop when PubSub is down #178

Closed keenon closed 11 months ago

keenon commented 1 year ago

We had an incident when we accidentally re-queued the same subject about two dozen times for processing: protected/us-west-2:141414fe-79cb-4c93-96e9-e9487a9ce7d8/data/Gait_test/P01/

Here are the SLURM manager log IDs in the neighborhood of the problem:

31443956 31445178 - this one has the first re-entry 31446797 31448366 - This is where the problem really takes off 31450314 - And then it appears fixed...

It appears the issue is caused by PubSub temporarily disconnecting, and then our mechanism for pushing the PROCESSING flag file does not update the local Python state (if it never receives an update from PubSub). So if PubSub goes down, but Slurm stays up, it is possible to process the same subject in a very aggressive loop (every ten seconds) until PubSub comes back.

This requires a bit of a re-architecture of how the S3 state is managed in the Python code in the app/ folder, which will take a bit of time to debug.