Closed tonney007 closed 2 months ago
[rank1]: File "/opt/ml/code/Open-Sora/scripts/train.py", line 235, in main [rank1]: for step, batch in pbar: [rank1]: ConnectionResetError: [Errno 104] Connection reset by peer 好像与这个问题类似,https://github.com/hpcaitech/Open-Sora/issues/431 试了没有好转。
请问你使用了多机训练吗?这个可能是机器的 NCCL 配置问题。还是需要更多信息来分析。
This issue is stale because it has been open for 7 days with no activity.
[rank1]: File "/opt/ml/code/Open-Sora/scripts/train.py", line 235, in main [rank1]: for step, batch in pbar: [rank1]: ConnectionResetError: [Errno 104] Connection reset by peer 好像与这个问题类似,https://github.com/hpcaitech/Open-Sora/issues/431 试了没有好转。