jeonsworld / ViT-pytorch

Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
MIT License
1.94k stars 370 forks source link

A bug when using Apex DDP #26

Closed houwenxin closed 3 years ago

houwenxin commented 3 years ago

Hi jeonsworld, thank you for providing this awesome repo of Vision Transformer! I tried to use it but met a problem when I use distributed training, although the problem seems to be around Apex, but do you know the reason? I will appreciate it a lot if you could help me with it.

This is the command that I used:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz

And this is the Error information:

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed. Killing subprocess 37950 Killing subprocess 37951 Killing subprocess 37952 Killing subprocess 37953 Traceback (most recent call last): File "/opt/miniconda/envs/vit/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/miniconda/envs/vit/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/miniconda/envs/vit/lib/python3.9/site-packages/torch/distributed/launch.py", line 340, in main() File "/opt/miniconda/envs/vit/lib/python3.9/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/opt/miniconda/envs/vit/lib/python3.9/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/opt/miniconda/envs/vit/bin/python', '-u', 'train.py', '--local_rank=3', '--name', 'cifar10-100_500', '--dataset', 'cifar10', '--model_type', 'ViT-B_16', '--pretrained_dir', 'checkpoint/ViT-B_16.npz']' returned non-zero exit status 1.