hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
21.76k stars 2.11k forks source link

训练过程中的链接错误 #446

Closed tonney007 closed 2 months ago

tonney007 commented 3 months ago

[rank1]: File "/opt/ml/code/Open-Sora/scripts/train.py", line 235, in main [rank1]: for step, batch in pbar: [rank1]: ConnectionResetError: [Errno 104] Connection reset by peer 好像与这个问题类似,https://github.com/hpcaitech/Open-Sora/issues/431 试了没有好转。

zhengzangw commented 3 months ago

请问你使用了多机训练吗?这个可能是机器的 NCCL 配置问题。还是需要更多信息来分析。

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 7 days with no activity.