Facing NCCL error on Multi-GPU training(on single machine) using run_glue.py script

aditya-malte commented 3 years ago

Environment info

transformers version: 4.3.2
Platform: Linux-4.19.0-14-cloud-amd64-x86_64-with-debian-buster-sid
Python version: 3.7.9
PyTorch version (GPU?): 1.7.0 (True)
Tensorflow version (GPU?): 2.4.1 (True)
Using GPU in script?: 4xTesla T4 (GCP)
Using distributed or parallel set-up in script?: torch.distributed.launch

Who can help

Information

Model I am using (Bert, XLNet ...): DistilRoberta

The problem arises when using:

[*] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[*] my own task or dataset: (give details below)

Regression task with a single output, using BertForSequenceClassification

To reproduce

Steps to reproduce the behavior:

1.python -m torch.distributed.launch --nproc_per_node 4 /home/run_glue.py --train_file /home/data/train.csv --validation_file /home/data/dev.csv --test_file /home/data/test.csv --model_name_or_path distilroberta-base --output_dir /home/model --num_train_epochs 5 --per_device_train_batch_size 1 --per_device_eval_batch_size 16 --do_train --do_eval --fp16 --gradient_accumulation_steps 2 --do_predict --logging_steps 100 --evaluation_strategy steps --save_steps 100 --overwrite_output_dir

File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group 732793de051f:1895:1925 [1] NCCL INFO transport/shm.cc:101 -> 2 732793de051f:1895:1925 [1] NCCL INFO transport.cc:30 -> 2 732793de051f:1895:1925 [1] NCCL INFO transport.cc:49 -> 2 732793de051f:1895:1925 [1] NCCL INFO init.cc:766 -> 2 732793de051f:1895:1925 [1] NCCL INFO init.cc:840 -> 2 732793de051f:1895:1925 [1] NCCL INFO group.cc:73 -> 2 [Async thread] barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier Traceback (most recent call last): File "/home/run_text_classification.py", line 480, in work = _default_pg.barrier() RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729138878/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8 main() File "/home/run_text_classification.py", line 163, in main model_args, data_args, training_args = parser.parse_args_into_dataclasses() File "/opt/conda/lib/python3.7/site-packages/transformers/hf_argparser.py", line 180, in parse_args_into_dataclasses obj = dtype(*inputs) File "", line 60, in init File "/opt/conda/lib/python3.7/site-packages/transformers/training_args.py", line 478, in __post_init__ if is_torch_available() and self.device.type != "cuda" and self.fp16: File "/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py", line 1346, in wrapper return func(args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/transformers/training_args.py", line 583, in device return self._setup_devices

732793de051f:1897:1927 [3] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device File "/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py", line 1336, in get 732793de051f:1897:1927 [3] NCCL INFO include/shm.h:41 -> 2

732793de051f:1897:1927 [3] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-b3d54cebe4167a34-0-2-3 (size 9637888)

Expected behavior

Expected model training to proceed smoothly using 4xGPU. When I run the said script with nproc_per_node=1(or even 2), it runs smoothly but setting it as 4 gives strange errors.

After updating to 1.9.0 I face a different error:

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:832, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

patil-suraj commented 3 years ago

cc @sgugger

sgugger commented 3 years ago

This seems like a problem in your environment install for NCCL: if your script can run on two GPUs there is nothing in the code to change to make it run on four GPUs so this is not a bug in the training script or transformers. I have never seen that particular NCCL error so I'm afraid I can't really help debugging it.

aditya-malte commented 3 years ago

Thanks for the quick reply. Yeah, it’s strange that it works on 2 GPUs but not on 4. Will check again and let you know.

aditya-malte commented 3 years ago

@sgugger just to clarify: The system has 4 GPUs. It’s only the nproc_per_node argument I’m changing (from 1 to 2,4,etc.). Just want to ensure I’ve not misunderstood the cause of the error. Right?

sgugger commented 3 years ago

Yes I understood that. The PyTorch launcher is going to spawn a different number of processes depending on the number your pass, which in turn will use the number of GPUs specified (and the others are idle).

aditya-malte commented 3 years ago

Thanks. Just wanted to confirm that. Will try reinstalling the environment and update if I find the solution.

aditya-malte commented 3 years ago

Hi @sgugger, Good news, the issue seems to have been an environment issue. Thanks for the instant help

ljz756245026 commented 3 years ago

I still meet the same problem， could you please tell me how to solve it？

huggingface / transformers