tensor size mismatch with larger gradient_accumulation_steps and fewer training data

yyymeta commented 1 year ago

System Info

A100 Nvidia 80G GPU

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

it seems when I have fewer training examples (1000 or so), and when I use a larger gradient_accumulation_steps, (32) I get tensor size mismatch on Adam gradient update:

\"/tmp/jetter.yrcudeja/torch/optim/optimizer.py\", line 33, in _use_grad\n    ret = func(self, *args, **kwargs)\n  File \"/tmp/jetter.yrcudeja/torch/optim/adamw.py\", line 173, in step\n    adamw(\n  File \"/tmp/jetter.yrcudeja/torch/optim/adamw.py\", line 323, in adamw\n    func(\n  File \"/tmp/jetter.yrcudeja/torch/optim/adamw.py\", line 502, in _multi_tensor_adamw\n    torch._foreach_add_(device_exp_avgs, device_grads, alpha=1 - beta1)\nRuntimeError: The size of tensor a (8192384) must match the size of tensor b (262156288) at non-singleton dimension 0\n", "errorTraits": null, "timestamp_us": 1692818123557766}
[4]:  File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[4]:    return _run_code(code, main_globals, None,
[4]:  File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code
[4]:    exec(code, run_globals)
[4]:  File "/tmp/jetter.12kzp8qf/aml/comment/llama_finetune/train.py", line 150, in <module>
[4]:    train()
[4]:  File "/tmp/jetter.12kzp8qf/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
[4]:    return f(*args, **kwargs)
[4]:  File "/tmp/jetter.12kzp8qf/aml/comment/llama_finetune/train.py", line 124, in train
[4]:    trainer.train()
[4]:  File "/tmp/jetter.12kzp8qf/transformers/trainer.py", line 1664, in train
[4]:    return inner_training_loop(
[4]:  File "/tmp/jetter.12kzp8qf/transformers/trainer.py", line 1998, in _inner_training_loop
[4]:    self.optimizer.step()
[4]:  File "/tmp/jetter.12kzp8qf/torch/optim/lr_scheduler.py", line 69, in wrapper
[4]:    return wrapped(*args, **kwargs)
[4]:  File "/tmp/jetter.12kzp8qf/torch/optim/optimizer.py", line 280, in wrapper
[4]:    out = func(*args, **kwargs)
[4]:  File "/tmp/jetter.12kzp8qf/torch/optim/optimizer.py", line 33, in _use_grad
[4]:    ret = func(self, *args, **kwargs)
[4]:  File "/tmp/jetter.12kzp8qf/torch/optim/adamw.py", line 173, in step
[4]:    adamw(
[4]:  File "/tmp/jetter.12kzp8qf/torch/optim/adamw.py", line 323, in adamw
[4]:    func(
[4]:  File "/tmp/jetter.12kzp8qf/torch/optim/adamw.py", line 502, in _multi_tensor_adamw
[4]:    torch._foreach_add_(device_exp_avgs, device_grads, alpha=1 - beta1)
[4]:RuntimeError: The size of tensor a (8192384) must match the size of tensor b (262156288) at non-singleton dimension 0
[7]:ERROR:aiplatform.error_reporting.error_reporting:Exception Found: The size of tensor a (8192384) must match the size of tensor b (262156288) at non-singleton dimension 0

using gradient_accumulation_steps=1 fixes it, but then it causes some impact on model quality

command was

/packages/torchx_python/python

-m torch.distributed.run --rdzv_backend zeus --rdzv_id torchx-llama_finetune_train-k64xgt1h4dt52c --nnodes 4 --nproc_per_node 8 --tee 3 --role -m aml.comment.llama_finetune.train --local_dir /tmp/users --model_manifold_bucket pi_adv_problems --model_manifold_dir tree/dpa_llama --input_model_filename 7B-converted --output_model_filename yytest__v7_instagram_basic_5e-6 --data_path manifold://pi_adv_problems/tree/appreview_llama/data/v7/traininstagram_basic.json --eval_data_path manifold://pi_adv_problems/tree/appreview_llama/data/v7/evalinstagram_basic.json --data_task generic --prompt_temp normal --processed True --model_max_length 1024 --num_train_epochs 30 --per_device_train_batch_size 2 --per_device_eval_batch_size 8 --gradient_accumulation_steps 32 --evaluation_strategy steps --eval_steps 10 --save_strategy steps --save_steps 200 --save_total_limit 1 --learning_rate 5e-6 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 50 --fsdp full_shard auto_wrap --fsdp_transformer_layer_cls_to_wrap LlamaDecoderLayer --bf16 True --tf32 True

Expected behavior

note that the above error shows

yyy@yyy-mbp ~ % echo '262156288/8192384'|bc -l 32.00000000000000000000

so somehow it seems only one gradient is obtained while maybe 32 are expected?

ArthurZucker commented 1 year ago

Hey! If you want help from our trainer expert @pacman100 we're gonna need to have a look at the training script or at least have a reproducer.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

cattleyawlotus commented 7 months ago

Hi, I have encountered the same problem, fewer training examples and adam optimizer. May I ask if you have resolved it? How was it resolved?

farzadab commented 1 month ago

Same issue here. I get such an error when doing FSDP training. No error when using larger gradient_accumulation_steps without FSDP and also no error when gradient_accumulation_steps is 1.

I'd love to know if either of you solved it.

farzadab commented 1 month ago

Same issue reported here: https://discuss.huggingface.co/t/errors-when-using-gradient-accumulation-with-fsdp-peft-lora-sfttrainer/105006

ArthurZucker commented 1 week ago

Thanks, will re-open and add to the list of tracked issues

huggingface / transformers