Open eswarthammana opened 1 year ago
I faced a similar issue. I added a condition like below in run_gen.py
(line 75):
outputs = model(input_ids=source_ids, attention_mask=source_mask,
labels=target_ids, decoder_attention_mask=target_mask)
loss = outputs.loss
if args.n_gpu > 1:
loss = loss.mean()
It now works for me.
Hi, I'm unable to finetune with multiple GPUs. Can @eswarthammana or @alibrahimzada tell me about any modifications required to the scripts for this?
Tx
make sure you execute your script with torchrun
rather than python3/python
. I don't think there are other requirements for multi-GPU execution.
Hi @Sleepyhead01,
the one i tried is with in exp_with_args.sh at the end of the file CUDA_VISIBLE_DEVICES=${GPU} modify the ${GPU} value as 0, 1 through code it accepts only integer we cannot pass more than one value.
As @alibrahimzada mentioned modify the loss as loss.mean()
Training with multiple GPUs starts with this modification. However, eval_bleu_epoch gives the following error:
Traceback (most recent call last):
File "CodeT5/run_gen.py", line 392, in <module>
main()
File "CodeT5/run_gen.py", line 319, in main
result = eval_bleu_epoch(args, eval_data, eval_examples, model, tokenizer, 'dev', 'e%d' % cur_epoch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "CodeT5/run_gen.py", line 109, in eval_bleu_epoch
preds = model.generate(source_ids,
^^^^^^^^^^^^^^
File "anaconda3/envs/Old_R/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'generate'
Any fix for this? Tx
@Sleepyhead01 you need to do model.module.generate()
because for n_gpu > 1
... model is an attribute of DataParallel
. To get the model, you should call .module
on it.
Unfortunately the authors have not maintained these scripts with newer versions of torch.
Dear Team,
I tried to train the model with 2 core GPU as 0,1 I faced the following problem, which i have not faced with 1 core GPU. Could you please help me to solve the issue.
Environment: Kaggle Accelerator: GPU T4 x 2
/opt/conda/lib/python3.7/site-packages/transformers/optimization.py:395: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set
main()
File "/kaggle/working/CodeT5/run_gen.py", line 265, in main
eval_ppl = eval_ppl_epoch(args, eval_data, eval_examples, model, tokenizer)
File "/kaggle/working/CodeT5/run_gen.py", line 75, in eval_ppl_epoch
eval_loss += loss.item()
ValueError: only one element tensors can be converted to Python scalars
no_deprecation_warning=True
to disable this warning FutureWarning, Training: 0%| | 0/3125 [00:00<?, ?it/s]/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' [0] Train loss 0.258: 100%|██████████| 3125/3125 [29:17<00:00, 1.78it/s] 100%|██████████| 2000/2000 [00:07<00:00, 273.69it/s] Eval ppl: 0%| | 0/63 [00:00<?, ?it/s] Traceback (most recent call last): File "/kaggle/working/CodeT5/run_gen.py", line 387, in