Open yyymeta opened 1 year ago
Hey! If you want help from our trainer expert @pacman100 we're gonna need to have a look at the training script or at least have a reproducer.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi, I have encountered the same problem, fewer training examples and adam optimizer. May I ask if you have resolved it? How was it resolved?
Same issue here. I get such an error when doing FSDP training. No error when using larger gradient_accumulation_steps
without FSDP and also no error when gradient_accumulation_steps is 1.
I'd love to know if either of you solved it.
Thanks, will re-open and add to the list of tracked issues
System Info
A100 Nvidia 80G GPU
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
it seems when I have fewer training examples (1000 or so), and when I use a larger gradient_accumulation_steps, (32) I get tensor size mismatch on Adam gradient update:
using gradient_accumulation_steps=1 fixes it, but then it causes some impact on model quality
command was
/packages/torchx_python/python
-m torch.distributed.run --rdzv_backend zeus --rdzv_id torchx-llama_finetune_train-k64xgt1h4dt52c --nnodes 4 --nproc_per_node 8 --tee 3 --role -m aml.comment.llama_finetune.train --local_dir /tmp/users --model_manifold_bucket pi_adv_problems --model_manifold_dir tree/dpa_llama --input_model_filename 7B-converted --output_model_filename yytest__v7_instagram_basic_5e-6 --data_path manifold://pi_adv_problems/tree/appreview_llama/data/v7/traininstagram_basic.json --eval_data_path manifold://pi_adv_problems/tree/appreview_llama/data/v7/evalinstagram_basic.json --data_task generic --prompt_temp normal --processed True --model_max_length 1024 --num_train_epochs 30 --per_device_train_batch_size 2 --per_device_eval_batch_size 8 --gradient_accumulation_steps 32 --evaluation_strategy steps --eval_steps 10 --save_strategy steps --save_steps 200 --save_total_limit 1 --learning_rate 5e-6 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 50 --fsdp full_shard auto_wrap --fsdp_transformer_layer_cls_to_wrap LlamaDecoderLayer --bf16 True --tf32 True
Expected behavior
note that the above error shows
yyy@yyy-mbp ~ % echo '262156288/8192384'|bc -l 32.00000000000000000000
so somehow it seems only one gradient is obtained while maybe 32 are expected?