huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
690 stars 77 forks source link

Children jobs are not created after `JobManagerCrashedError` #2765

Open severo opened 5 months ago

severo commented 5 months ago

This would explain why we have ~500 PreviousStepStillProcessingError entries, while they should be temporary.

https://github.com/huggingface/dataset-viewer/pull/2758#issuecomment-2090300183

severo commented 5 months ago

Fixed by #2769

severo commented 5 months ago

Correction: it's not fixed by #2769

severo commented 5 months ago

Hmmm, I could not find a reason why the children steps would not be created when a zombie job is finished by the "zombie killer". Maybe we have current surge in PreviousStepStillProcessingError entries because we have a lot of started jobs. Let's wait a bit.

severo commented 5 months ago

still ~70 cases for PreviousStepStillProcessingError

Capture d’écran 2024-05-15 à 10 53 53
> use datasets_server_cache
> db.cachedResponsesBlue.countDocuments({error_code: "PreviousStepStillProcessingError", "details.copied_from_artifact":{$exists:false}})
< 64
> db.cachedResponsesBlue.countDocuments({error_code: "PreviousStepStillProcessingError", "details.copied_from_artifact":{$exists:true}})
< 23

PreviousStepStillProcessingError is raised when CachedArtifactNotFoundError is raised while a job runner calls get_previous_step_or_raise, which can occur in a lot of places: https://github.com/search?q=repo%3Ahuggingface%2Fdataset-viewer%20get_previous_step_or_raise&type=code.