Dear authors, I ran into an error when saving the model. Concretely, the program stuck at the model saving stage and timeout after ~30mins. Seems that's a FSDP issue? Do you happen to know how to resolve this issue? Thanks!
If anybody happens to know the solution please help me, I've been stuck here for several days. Many thx!!
The traceback info is similar to the following.
{'train_runtime': 119.8187, 'train_samples_per_second': 2.504, 'train_steps_per_second': 0.075, 'train_loss': 1.2388523353470697, 'epoch': 2.57}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [01:54<00:00, 12.76s/it]
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2778, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801649 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2778, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801649 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 5037) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/tmp/FastChat/fastchat/train/train_mem.py FAILED
Failures:
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-07_07:00:37
host : edf307caae46
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 5037)
error_file:
traceback : Signal 6 (SIGABRT) received by PID 5037
=====================================================
Dear authors, I ran into an error when saving the model. Concretely, the program stuck at the model saving stage and timeout after ~30mins. Seems that's a FSDP issue? Do you happen to know how to resolve this issue? Thanks!
If anybody happens to know the solution please help me, I've been stuck here for several days. Many thx!!
The traceback info is similar to the following.
Failures: