Closed aditya-malte closed 3 years ago
cc @sgugger
This seems like a problem in your environment install for NCCL: if your script can run on two GPUs there is nothing in the code to change to make it run on four GPUs so this is not a bug in the training script or transformers. I have never seen that particular NCCL error so I'm afraid I can't really help debugging it.
Thanks for the quick reply. Yeah, it’s strange that it works on 2 GPUs but not on 4. Will check again and let you know.
@sgugger just to clarify: The system has 4 GPUs. It’s only the nproc_per_node argument I’m changing (from 1 to 2,4,etc.). Just want to ensure I’ve not misunderstood the cause of the error. Right?
Yes I understood that. The PyTorch launcher is going to spawn a different number of processes depending on the number your pass, which in turn will use the number of GPUs specified (and the others are idle).
Thanks. Just wanted to confirm that. Will try reinstalling the environment and update if I find the solution.
Hi @sgugger, Good news, the issue seems to have been an environment issue. Thanks for the instant help
I still meet the same problem, could you please tell me how to solve it?
Environment info
transformers
version: 4.3.2Who can help
Information
Model I am using (Bert, XLNet ...): DistilRoberta
The problem arises when using:
The tasks I am working on is:
Regression task with a single output, using BertForSequenceClassification
To reproduce
Steps to reproduce the behavior:
1.python -m torch.distributed.launch --nproc_per_node 4 /home/run_glue.py --train_file /home/data/train.csv --validation_file /home/data/dev.csv --test_file /home/data/test.csv --model_name_or_path distilroberta-base --output_dir /home/model --num_train_epochs 5 --per_device_train_batch_size 1 --per_device_eval_batch_size 16 --do_train --do_eval --fp16 --gradient_accumulation_steps 2 --do_predict --logging_steps 100 --evaluation_strategy steps --save_steps 100 --overwrite_output_dir
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group 732793de051f:1895:1925 [1] NCCL INFO transport/shm.cc:101 -> 2 732793de051f:1895:1925 [1] NCCL INFO transport.cc:30 -> 2 732793de051f:1895:1925 [1] NCCL INFO transport.cc:49 -> 2 732793de051f:1895:1925 [1] NCCL INFO init.cc:766 -> 2 732793de051f:1895:1925 [1] NCCL INFO init.cc:840 -> 2 732793de051f:1895:1925 [1] NCCL INFO group.cc:73 -> 2 [Async thread] barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier Traceback (most recent call last): File "/home/run_text_classification.py", line 480, in
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729138878/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
main()
File "/home/run_text_classification.py", line 163, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/opt/conda/lib/python3.7/site-packages/transformers/hf_argparser.py", line 180, in parse_args_into_dataclasses
obj = dtype(*inputs)
File "", line 60, in init
File "/opt/conda/lib/python3.7/site-packages/transformers/training_args.py", line 478, in __post_init__
if is_torch_available() and self.device.type != "cuda" and self.fp16:
File "/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py", line 1346, in wrapper
return func( args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/training_args.py", line 583, in device
return self._setup_devices
732793de051f:1897:1927 [3] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device File "/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py", line 1336, in get 732793de051f:1897:1927 [3] NCCL INFO include/shm.h:41 -> 2
732793de051f:1897:1927 [3] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-b3d54cebe4167a34-0-2-3 (size 9637888)
Expected behavior
Expected model training to proceed smoothly using 4xGPU. When I run the said script with nproc_per_node=1(or even 2), it runs smoothly but setting it as 4 gives strange errors.
After updating to 1.9.0 I face a different error:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:832, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.