Multiple GPUs training issue -- RuntimeError: arguments are located on different GPUs

MarianCodrinCretu commented 3 years ago

Hello!

I would like you kindly to revise this question, if possible: I am trying to run the project using this setup (f.e: for RTE task) and with all dependencies installed from requirements.txt (from master branch)

!python cli.py
--method pet      
--model_type albert \
--data_dir data/FewGLUE/RTE \
--task_name rte \
--model_name_or_path albert-xxlarge-v2 \
--output_dir ./models/petModels/rte_albert \
--do_train \
--do_eval \
--pet_per_gpu_eval_batch_size 8 \
--pet_per_gpu_train_batch_size 2 \
--pet_gradient_accumulation_steps 8 \
--pet_max_steps 250 \
--pet_max_seq_length 256 \
--pet_repetitions 3 \
--sc_per_gpu_train_batch_size 2 \
--sc_per_gpu_unlabeled_batch_size 2 \
--sc_gradient_accumulation_steps 8 \
--sc_max_steps 5000 \
--sc_max_seq_length 256 \
--sc_repetitions 1

and I am obtaining the following error only when I am trying to use multiple GPU s (for a machine with a single GPU, it works)

Traceback (most recent call last):
  File "cli.py", line 283, in <module>
    main()
  File "cli.py", line 264, in main
    no_distillation=args.no_distillation, seed=args.seed)
  File "/$PATH1/pet/pet/modeling.py", line 249, in train_pet
    save_unlabeled_logits=not no_distillation, seed=seed)
  File "/$PATH1/pet/pet/modeling.py", line 355, in train_pet_ensemble
    unlabeled_data=unlabeled_data))
  File "/$PATH1/pet/pet/modeling.py", line 459, in train_single_model
    temperature=config.temperature
  File "/$PATH1/pet/pet/wrapper.py", line 300, in train
    loss = TRAIN_STEP_FUNCTIONS[self.config.wrapper_type](self)(batch, **train_step_inputs)
  File "/$PATH1/pet/pet/wrapper.py", line 478, in mlm_train_step
    outputs = self.model(**inputs)
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_albert.py", line 814, in forward
    output_hidden_states=output_hidden_states,
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_albert.py", line 556, in forward
    input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/modeling_bert.py", line 178, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/$PATH2/anaconda3/envs/python3/lib/python3.6/site-packages/torch/nn/functional.py", line 1724, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:403

Do you have any ideas of what shall I change in order to work, or if am I doing something wrong? I will apreciate any hints or remarks, if possible!

timoschick commented 3 years ago

Unfortunately, I have no experience with running PET in a multi-GPU setting (for all of our experiments, we've used a single GPU). As the error occurs at the embedding layer (return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)) and the error message says arguments are located on different GPUs, my best guess would be that the inputs are on a different device than the model's embedding weights.

The inputs are moved to GPU in this line in wrapper.py and a DataParallel wrapper is put around the model in this line. I found this discussion of an issue that seems to be very similar (it also proposes a solution for the problem that might be worth trying).

Sadly, I don't have the time right now to dive deeper into this issue. If you get PET to work with multiple GPUs, feel free to create a pull request :)

MarianCodrinCretu commented 3 years ago

Thank you for the feedback!

rubbybbs commented 3 years ago

Comment out line 359-360 in wrapper.py helps in my case. The error is caused by declaring self.model = torch.nn.DataParallel(self.model) twice during train and eval.

timoschick commented 3 years ago

Ah, that makes sense! Thanks, @rubbybbs - feel free to write a pull request with this modification if you find the time :)

timoschick / pet

Multiple GPUs training issue -- RuntimeError: arguments are located on different GPUs #33