Open workingloong opened 7 months ago
A blacklist mechanism can be introduced to this case: throw an explicit error for user code errors with no more retry and free up resources.
This issue has been automatically marked as stale because it has not had recent activity.
This issue is being automatically closed due to inactivity.
The restarted worker will fail again if the training fails due to a code bug. The job should exit as soon as possible to release resources on a cluster.