Stuck on training - Githubissues

aaghawaheed commented 1 year ago

When I start training on multiple gpus it stuck, you can check in screenshot

aaghawaheed commented 1 year ago

sh configs/r50_motr_train.sh /home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

| distributed init (rank 0): env:// | distributed init (rank 1): env://

[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809347 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809351 milliseconds before timing out. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 118400) of binary: /home/user/anaconda3/envs/motr2/bin/python3 Traceback (most recent call last): File "/home/user/anaconda3/envs/motr2/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/user/anaconda3/envs/motr2/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in main() File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures: [1]: time : 2022-12-26_18:00:48 host : user-System-Product-Name rank : 1 (local_rank: 1) exitcode : -6 (pid: 118401) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 118401

Root Cause (first observed failure): [0]: time : 2022-12-26_18:00:48 host : user-System-Product-Name rank : 0 (local_rank: 0) exitcode : -6 (pid: 118400) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 118400

aaghawaheed commented 1 year ago

The process stuck at torch.distributed.barrier()

Here is my env information

PyTorch version: 1.13.0 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31

Python version: 3.10.8 (main, Nov 24 2022, 14:13:03) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.7.99 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090 GPU 1: NVIDIA GeForce RTX 4090 GPU 2: NVIDIA GeForce RTX 4090 GPU 3: NVIDIA GeForce RTX 4090

Nvidia driver version: 525.60.11 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.23.4 [pip3] torch==1.13.0 [pip3] torchaudio==0.13.0 [pip3] torchvision==0.14.0 [conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py310h7f8727e_0
[conda] mkl_fft 1.3.1 py310hd6ae3a3_0
[conda] mkl_random 1.2.2 py310h00e6091_0
[conda] numpy 1.23.4 py310hd5efca6_0
[conda] numpy-base 1.23.4 py310h8e6c178_0
[conda] pytorch 1.13.0 py3.10_cuda11.7_cudnn8.5.0_0 pytorch [conda] pytorch-cuda 11.7 h67b0de4_1 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchaudio 0.13.0 py310_cu117 pytorch [conda] torchvision 0.14.0 py310_cu117 pytorch

megvii-research / MOTR

Stuck on training #63

main.py FAILED

Failures: [1]: time : 2022-12-26_18:00:48 host : user-System-Product-Name rank : 1 (local_rank: 1) exitcode : -6 (pid: 118401) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 118401