Open severo opened 5 months ago
Fixed by #2769
Correction: it's not fixed by #2769
Hmmm, I could not find a reason why the children steps would not be created when a zombie job is finished by the "zombie killer".
Maybe we have current surge in PreviousStepStillProcessingError
entries because we have a lot of started jobs. Let's wait a bit.
still ~70 cases for PreviousStepStillProcessingError
> use datasets_server_cache
> db.cachedResponsesBlue.countDocuments({error_code: "PreviousStepStillProcessingError", "details.copied_from_artifact":{$exists:false}})
< 64
> db.cachedResponsesBlue.countDocuments({error_code: "PreviousStepStillProcessingError", "details.copied_from_artifact":{$exists:true}})
< 23
PreviousStepStillProcessingError
is raised when CachedArtifactNotFoundError
is raised while a job runner calls get_previous_step_or_raise
, which can occur in a lot of places: https://github.com/search?q=repo%3Ahuggingface%2Fdataset-viewer%20get_previous_step_or_raise&type=code.
This would explain why we have ~500
PreviousStepStillProcessingError
entries, while they should be temporary.https://github.com/huggingface/dataset-viewer/pull/2758#issuecomment-2090300183