tatsu-lab / stanford_alpaca

Code and documentation to train Stanford's Alpaca models, and generate the data.
https://crfm.stanford.edu/2023/03/13/alpaca.html
Apache License 2.0
29.38k stars 4.03k forks source link

NET/IB : Got completion from peer 11.214.147.122<39138> with error 12, opcode 0, len 0, vendor err 129 #207

Open lmx760581375 opened 1 year ago

lmx760581375 commented 1 year ago

[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error: remote process exited or there was a network error, NCCL version 2.14.3 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: NET/IB : Got completion from peer 11.214.147.122<39138> with error 12, opcode 0, len 0, vendor err 129

Ahtesham00 commented 1 year ago

torch.distributed.init_process_group(backend='nccl', init_method='env://', timeout=datetime.timedelta(seconds=1800))