DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
34.94k
stars
4.06k
forks
source link
Multi-node and multi-GPU fine-tuning error: ncclInternalError #4056
Closed
sunxiaoyu12 closed 1 year ago
There are 3 nodes used for this finetune,and each node has 4 GPUs.
After loading the model,see error as follows:
Print out details as follows:
Looking forward to reply!thanks~