huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.35k stars 26.12k forks source link

If a training job job failed MLFlow will not be reported and MLFlow shows job still running #30333

Open helloworld1 opened 4 months ago

helloworld1 commented 4 months ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

  1. Setup MLFlow integration correctly,
  2. Run a job.
  3. The job failed due to OOM error.
  4. Go to MLFlow UI and the job experiment shows status "Running"

Expected behavior

MLFlow callback should report the job as failure and call end_run() instead of keeping "Running "status.

amyeroberts commented 4 months ago

cc @muellerzr @pacman100

amyeroberts commented 3 months ago

Gentle ping @pacman100 @muellerzr

amyeroberts commented 2 months ago

cc @muellerzr @SunMarc

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

helloworld1 commented 3 weeks ago

The issue is still not resolved

amyeroberts commented 3 weeks ago

Adding a Good Second Issue for anyone who would like to tackle this in the community