shikras / shikra

Other
710 stars 44 forks source link

An NCCL RuntimeError occurred when saving the model #41

Open Lanxin1011 opened 10 months ago

Lanxin1011 commented 10 months ago

Dear authors, I ran into an error when saving the model. Concretely, the program stuck at the model saving stage and timeout after ~30mins. Seems that's a FSDP issue? Do you happen to know how to resolve this issue? Thanks!

If anybody happens to know the solution please help me, I've been stuck here for several days. Many thx!!

The traceback info is similar to the following.

{'train_runtime': 119.8187, 'train_samples_per_second': 2.504, 'train_steps_per_second': 0.075, 'train_loss': 1.2388523353470697, 'epoch': 2.57} 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [01:54<00:00, 12.76s/it] [E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2778, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801649 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2778, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801649 milliseconds before timing out. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 5037) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/tmp/FastChat/fastchat/train/train_mem.py FAILED

Failures:

----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-04-07_07:00:37 host : edf307caae46 rank : 0 (local_rank: 0) exitcode : -6 (pid: 5037) error_file: traceback : Signal 6 (SIGABRT) received by PID 5037 =====================================================