Closed Xuekai-Zhu closed 1 year ago
Through some preliminary checks, it is to do with line 354:
model = GeminiDDP(model, device=get_current_device(), placement_policy=PLACEMENT_POLICY, pin_memory=True)
.
It is highly likely that your cpu runs out of memory due to exitcode (-9).
Try allocating more main memory to your trial and run again.
Thank you very much, i will try these days. Your advice are very useful!
We have updated a lot. This issue was closed due to inactivity. Thanks.
🐛 Describe the bug
[E ProcessGroupNCCL.cpp:737] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3424, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1803741 milliseconds before timing out. [E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 12076) of binary: /root/anaconda3/envs/py38/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/py38/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/root/anaconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/root/anaconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I use the following riviesd version of OPT example
Environment
Environment requirement.txt is default (demo).
2 3090 24G GPU