intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.26k stars 166 forks source link

scale down allreduct pytorch job won't complete and report error #1215

Closed cocodee closed 2 days ago

cocodee commented 3 months ago

假设集群可用cpu资源为4份。 创建两个pod,占用2份cpu资源。 创建torch-mnist job,设置min_node=2,node-unit=2,max_node=$NODE_NUM,NODE_NUM=4。每个node需要占用1份cpu资源。 3.1 保持资源状况直到训练结束(修改代码) job会有两个worker处于running状态,其他两个worker处于pending状态。两个worker组成rendezvous,并完成训练,状态转换成complete.其他两个pending worker获得资源,转换成running状态,继续训练,但会报错,训练始终无法完成

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity.

github-actions[bot] commented 2 days ago

This issue is being automatically closed due to inactivity.