intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.19k stars 146 forks source link

How to use the elasticity and fault tolerance in a Volcano job. #1172

Open workingloong opened 2 months ago

workingloong commented 2 months ago

Now, the elastic scheduling in DLRover ElasticJob is suitable for asynchronous SGD of recommendation model training but not sync SGD. In a sync SGD job, the training cannot start is the number of nodes is less than the required number. The elastic scheduling will launch as many as possible Pods even if there is not enough available nodes in the cluster. The running Pods must wait for the pending Pods which result in the waste of machine. The gang scheduling and topology Aware Scheduling in Volcano is very suitable for synchronous SGD of LLM training.

jinqinn commented 1 month ago

any update ?