scale down allreduct pytorch job won't complete and report error

cocodee commented 3 months ago

假设集群可用cpu资源为4份。创建两个pod,占用2份cpu资源。创建torch-mnist job,设置min_node=2,node-unit=2,max_node=$NODE_NUM,NODE_NUM=4。每个node需要占用1份cpu资源。 3.1 保持资源状况直到训练结束（修改代码) job会有两个worker处于running状态，其他两个worker处于pending状态。两个worker组成rendezvous，并完成训练，状态转换成complete.其他两个pending worker获得资源，转换成running状态，继续训练，但会报错，训练始终无法完成

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity.

github-actions[bot] commented 2 days ago

This issue is being automatically closed due to inactivity.

intelligent-machine-learning / dlrover

scale down allreduct pytorch job won't complete and report error #1215