intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.27k stars 167 forks source link

The job stops restarting workers and exits if the traceback is a code bug. #1068

Open workingloong opened 7 months ago

workingloong commented 7 months ago

The restarted worker will fail again if the training fails due to a code bug. The job should exit as soon as possible to release resources on a cluster.

BalaBalaYi commented 7 months ago

A blacklist mechanism can be introduced to this case: throw an explicit error for user code errors with no more retry and free up resources.

BalaBalaYi commented 6 months ago

user_code_bug_demo_log.txt

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity.

github-actions[bot] commented 4 weeks ago

This issue is being automatically closed due to inactivity.