We had an incident when we accidentally re-queued the same subject about two dozen times for processing: protected/us-west-2:141414fe-79cb-4c93-96e9-e9487a9ce7d8/data/Gait_test/P01/
Here are the SLURM manager log IDs in the neighborhood of the problem:
31443956
31445178 - this one has the first re-entry
31446797
31448366 - This is where the problem really takes off
31450314 - And then it appears fixed...
It appears the issue is caused by PubSub temporarily disconnecting, and then our mechanism for pushing the PROCESSING flag file does not update the local Python state (if it never receives an update from PubSub). So if PubSub goes down, but Slurm stays up, it is possible to process the same subject in a very aggressive loop (every ten seconds) until PubSub comes back.
This requires a bit of a re-architecture of how the S3 state is managed in the Python code in the app/ folder, which will take a bit of time to debug.
We had an incident when we accidentally re-queued the same subject about two dozen times for processing: protected/us-west-2:141414fe-79cb-4c93-96e9-e9487a9ce7d8/data/Gait_test/P01/
Here are the SLURM manager log IDs in the neighborhood of the problem:
31443956 31445178 - this one has the first re-entry 31446797 31448366 - This is where the problem really takes off 31450314 - And then it appears fixed...
It appears the issue is caused by PubSub temporarily disconnecting, and then our mechanism for pushing the PROCESSING flag file does not update the local Python state (if it never receives an update from PubSub). So if PubSub goes down, but Slurm stays up, it is possible to process the same subject in a very aggressive loop (every ten seconds) until PubSub comes back.
This requires a bit of a re-architecture of how the S3 state is managed in the Python code in the
app/
folder, which will take a bit of time to debug.