huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.02k stars 27.01k forks source link

Why training time is much more than same task in Fairseq? #17075

Closed ElderWanng closed 2 years ago

ElderWanng commented 2 years ago

System Info

- `transformers` version: 4.19.0.dev0
- Platform: Linux-4.18.0-305.28.1.el8_4.x86_64-x86_64-with-glibc2.27
- Python version: 3.9.12
- Huggingface_hub version: 0.2.1
- PyTorch version (GPU?): 1.11.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

No response

Information

Tasks

Reproduction

Device: Tesla V100-PCIE-32GB * 1

For HF's Trainer:

python run_summarization.py \
    --model_name_or_path facebook/bart-large \
    --do_train \
    --do_eval \
    --do_predict \
    --dataset_name xsum \
    --output_dir /scratch/tw2112/datas/HFPLAY2 \
    --overwrite_output_dir \
    --max_grad_norm 0.1 \
    --label_smoothing_factor 0.1 \
    --fp16 True \
    --learning_rate 3e-05 \
    --lr_scheduler_type polynomial \
    --greater_is_better True \
    --warmup_steps 500 \
    --num_train_epochs 1 \
    --max_source_length 1024 \
    --max_target_length 1024 \
    --val_max_target_length 80 \
    --gradient_accumulation_steps 1 \
    --weight_decay 0.01 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 24 \
    --num_beams 6 \
    --save_strategy steps \
    --save_steps 2000 \
    --evaluation_strategy steps \
    --eval_steps 2000 \
    --load_best_model_at_end True \
    --metric_for_best_model loss \
    --greater_is_better False

The estimated time for 1 epoch is about 3.3h

For FairSeq:

WARMUP_UPDATES=500      
LR=3e-05
MAX_TOKENS=2048
BART_PATH=/scratch/tw2112/codes/models/bart.large/model.pt

fairseq-train xsum \
   --restore-file $BART_PATH \
   --task translation \
   --source-lang source --target-lang target \
   --truncate-source \
   --batch-size 8 \
   --layernorm-embedding \
   --share-all-embeddings \
   --share-decoder-input-output-embed \
   --reset-optimizer --reset-dataloader --reset-meters \
   --required-batch-size-multiple 1 \
   --arch bart_large \
   --criterion label_smoothed_cross_entropy  \
   --label-smoothing 0.1 \
   --dropout 0.1 --attention-dropout 0.1 \
   --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \
   --clip-norm 0.1 \
   --lr-scheduler polynomial_decay --lr $LR  --warmup-updates $WARMUP_UPDATES \
   --fp16 \
   --max-epoch 1\
   --skip-invalid-size-inputs-valid-test \
   --find-unused-parameters;

The estimated time for 1 epoch is about 1.5h

Expected behavior

I checked the config namespace in fairseq's log and changed the settings according to the example in Fairseq repo. I change both 2 commands to train on XSUM, 1 epoch, batch=8, gradient_accumulation_steps=1, and HF's time for 1 epoch is 3 times slower than fairseq's. Did I do anything wrong?
ElderWanng commented 2 years ago

Seems I didn't set the fp16 correctly----I should start the training by torchrun --nproc_per_node=4 and set --sharded_ddp zero_dp_3.

leminhyen2 commented 2 years ago

@ElderWanng Hey, I'm fairly new to machine learning. Can you tell me where do you run torchrun --nproc_per_node=4 and --sharded_ddp zero_dp_3? Thank you

ElderWanng commented 2 years ago

@ElderWanng Hey, I'm fairly new to machine learning. Can you tell me where do you run torchrun --nproc_per_node=4 and --sharded_ddp zero_dp_3? Thank you

torchrun is a substitutive cli command for 'python -m torch.distributed.launch' in newest pytorch 1.11, see in https://pytorch.org/docs/stable/elastic/run.html BTW, I'm using deepspeed now, it's faster. The method above is for fairscale. The start command is deepspeed --num_gpus=4 run_summarization.py \ --model_name_or_path facebook/bart-large \ --do_train --do_predict \ --dataset_name xsum \ --overwrite_output_dir \ --learning_rate 3e-05 \ --label_smoothing_factor 0.1 \ --greater_is_better True \ --warmup_steps 500 \ --num_train_epochs 10 \ --max_source_length 1024 \ --max_target_length 1024 \ --val_max_target_length 80 \ --gradient_accumulation_steps 2 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 32 \ --num_beams 6 \ --save_strategy steps \ --save_steps 750 \ --evaluation_strategy steps \ --eval_steps 750 \ --load_best_model_at_end True \ --metric_for_best_model loss \ --greater_is_better False \ --predict_with_generate \ --fp16 True \ --deepspeed deepspeed_config/ds_config_zero3.json

the config.json could find in https://github.com/huggingface/transformers/tree/main/tests/deepspeed

leminhyen2 commented 2 years ago

This is awesome, thank you for sharing :) Do you think it's possible to include deepspeed or fairscale in the code, like maybe a parameter in Trainer args? Or the only option is to write everything in a py file then run with torchrun or deepspeed?

ElderWanng commented 2 years ago

This is awesome, thank you for sharing :) Do you think it's possible to include deepspeed or fairscale in the code, like maybe a parameter in Trainer args? Or the only option is to write everything in a py file then run with torchrun or deepspeed?

In HF Trainer, I didn't see any workaround (I didn't dive in too much, maybe answer is yes). But I wrote another version in pytorch-lightning, which makes 'launcher as arguments' possible. And I reproduce the same rouge score on XSUM, the training time is the same. If you want I'm glad to share

ElderWanng commented 2 years ago

This is awesome, thank you for sharing :) Do you think it's possible to include deepspeed or fairscale in the code, like maybe a parameter in Trainer args? Or the only option is to write everything in a py file then run with torchrun or deepspeed?

Oh BTW, fairscale is as an extension for torch now. If not to activate fairscale, I still use torchrun as the launcher instead of python run_summarization.py. For 'launcher as arguments', the answer is yes at least on fairscale. U could simply switch on/off by change --strategy. I think sharded training is like a default training setting instead of naïve DDP in 2022.

For Deepspeed, I'm not sure whether the HF trainer maintains the clean cli args for us. I just followed the tutorial in https://huggingface.co/docs/transformers/main_classes/trainer

leminhyen2 commented 2 years ago

Thank you so much for the insight! I'm currently fine-tuning on Colab and the standard is to run code by cell so I was looking for ways to integrate the speedup in code. But actually now that I look back, fairseq kinda just wrapped most of the training code into the train.py file

Anyway, I'm glad that someone confirmed that Huggingface training can be as fast as Fairseq

ElderWanng commented 2 years ago

Thank you so much for the insight! I'm currently fine-tuning on Colab and the standard is to run code by cell so I was looking for ways to integrate the speedup in code. But actually now that I look back, fairseq kinda just wrapped most of the training code into the train.py file

Anyway, I'm glad that someone confirmed that Huggingface training can be as fast as Fairseq

Now I understand why you have to use args in pure python env. My advice is to try to activate fp16 training, that help a lot. I'm not sure if colab supports fp16 training. I used 4 x RTX8000 card, 10 epochs, about 40mins/epoch.

leminhyen2 commented 2 years ago

FP16 is activated and I run the code on one V100 GPU. When I was training fairseq, I enabled FP16 too.

HuggingFace seemed to take a lot longer for one Transformer Base epoch (1 hour) while Fairseq took around just 20-30 min for one Transformer Big epoch. If only HuggingFace can improve the speed by default, it will totally obliterate Fairseq

ElderWanng commented 2 years ago

fairscale or deepspeed would make it to about 40min (still can't catch up with fairseq from 2 years before)

leminhyen2 commented 2 years ago

So from your experience with all the optimization you have, it's still not as fast as fairseq? How much slower would you say the most optimal HF training compared with the normal fairseq?

ElderWanng commented 2 years ago

I forget it. It is close enough. I think the sacrifice of training time is worth when it compared to the inconvenience of fairseq.

leminhyen2 commented 2 years ago

Ah ok, good to hear, I'll apply all the optimization you mentioned. Hope it will go down to at least 40m an epoch.

leminhyen2 commented 2 years ago

Hey, it's me again, I tried a few approaches but surprisingly the most effective one is very simple. I simple increase the batch size in --per_device_train_batch_size. If you double the batch size, the training is literally twice as fast. One ML engineer told me bigger batch size will enable more efficient parallelization.

ElderWanng commented 2 years ago

Hey, it's me again, I tried a few approaches but surprisingly the most effective one is very simple. I simple increase the batch size in --per_device_train_batch_size. If you double the batch size, the training is literally twice as fast. One ML engineer told me bigger batch size will enable more efficient parallelization.

lol, the fairscale or deepspeed will save your GPURAM----by reducing unnecessary optimizer state overhead. Then you could press a bigger batch into training. Seems I forgot to make it clear: open those advanced settings and double the batch size. The fairseq use dynamic batchsize, loading more tokens by current available GPU RAM. But this code is hard to extract and apply to other frameworks. ( TBH I think fairseq in engineering aspect is really hard to say is a good framework ). For me the best settings in HFTrainer is deepzero stage2, no CPU offload, then adjust the batch size to occupy all GPU ram.

Nana12345678910 commented 2 years ago

What do you want me to do?

On Wed, 18 May 2565 BE at 5:33 am, ElderWanng @.***> wrote:

Hey, it's me again, I tried a few approaches but surprisingly the most effective one is very simple. I simple increase the batch size in --per_device_train_batch_size. If you double the batch size, the training is literally twice as fast. One ML engineer told me bigger batch size will enable more efficient parallelization.

lol, the fairscale or deepspeed will save your GPURAM----by reducing unnecessary optimizer state overhead. Then you could press a bigger batch into training. Seems I forgot to make it clear: open those advanced settings and double the batch size. The fairseq use dynamic batchsize, loading more tokens by current available GPU RAM. But this code is hard to extract and apply to other frameworks. ( TBH I think fairseq in engineering aspect is really hard to say is a good framework ). For me the best settings in HFTrainer is deepzero stage2, no CPU offload, then adjust the batch size to occupy all GPU ram.

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/17075#issuecomment-1129241051, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWVHH4XFX6M5FG5TCQKTZ3DVKPX7XANCNFSM5VANUWTQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Christina Vongphit iPhone

leminhyen2 commented 2 years ago

Hey, it's me again, I tried a few approaches but surprisingly the most effective one is very simple. I simple increase the batch size in --per_device_train_batch_size. If you double the batch size, the training is literally twice as fast. One ML engineer told me bigger batch size will enable more efficient parallelization.

lol, the fairscale or deepspeed will save your GPURAM----by reducing unnecessary optimizer state overhead. Then you could press a bigger batch into training. Seems I forgot to make it clear: open those advanced settings and double the batch size. The fairseq use dynamic batchsize, loading more tokens by current available GPU RAM. But this code is hard to extract and apply to other frameworks. ( TBH I think fairseq in engineering aspect is really hard to say is a good framework ). For me the best settings in HFTrainer is deepzero stage2, no CPU offload, then adjust the batch size to occupy all GPU ram.

Oh, now I really see how fairscale or deepspeed can reduce GPU RAM. Also, there is one issue I saw in HuggingFace but I don't know if you also encountered it. The GPU RAM seemed to be accumulated when I go from the first epoch to the second epoch, which crashed the training. This totally didn't happen with fairseq. After I include the data path in fairseq, then the GPU usage is always constant, never spiked. I wonder if you have any insight about this issue or if you just YOLO and use deepspeed and it solved everything lol

dumpmemory commented 2 years ago

u might set --eval_accumulation_steps=1,

leminhyen2 commented 2 years ago

u might set --eval_accumulation_steps=1,

Oh, this sound very interseting. I also saw this parameter when training. Gradient accumulation steps = 16.

Do you think it has something to do with epoch ram accumulation too?

ElderWanng commented 2 years ago

Hey, it's me again, I tried a few approaches but surprisingly the most effective one is very simple. I simple increase the batch size in --per_device_train_batch_size. If you double the batch size, the training is literally twice as fast. One ML engineer told me bigger batch size will enable more efficient parallelization.

lol, the fairscale or deepspeed will save your GPURAM----by reducing unnecessary optimizer state overhead. Then you could press a bigger batch into training. Seems I forgot to make it clear: open those advanced settings and double the batch size. The fairseq use dynamic batchsize, loading more tokens by current available GPU RAM. But this code is hard to extract and apply to other frameworks. ( TBH I think fairseq in engineering aspect is really hard to say is a good framework ). For me the best settings in HFTrainer is deepzero stage2, no CPU offload, then adjust the batch size to occupy all GPU ram.

Oh, now I really see how fairscale or deepspeed can reduce GPU RAM. Also, there is one issue I saw in HuggingFace but I don't know if you also encountered it. The GPU RAM seemed to be accumulated when I go from the first epoch to the second epoch, which crashed the training. This totally didn't happen with fairseq. After I include the data path in fairseq, then the GPU usage is always constant, never spiked. I wonder if you have any insight about this issue or if you just YOLO and use deepspeed and it solved everything lol

there is a "context manager" is fairseq: when catch cuda OOM error, trying to restore training. I didn't see similar logic in HF. That's what I said about FS having many engineering optimizations. So usually I leave 10% or so RAM margin by adjusting batch size in case of OOM. In terms of exploiting video memory, no other framework is as good as FS.

ElderWanng commented 2 years ago

u might set --eval_accumulation_steps=1,

Oh, this sound very interseting. I also saw this parameter when training. Gradient accumulation steps = 16.

Do you think it has something to do with epoch ram accumulation too?

updata_freq is just a trick to simulate "multi-card". In there original paper, they fine-tuned it on a 8 card node so this coeficcent is 2. I only have a 4 card node, so I set it to 4. When it comes to 1 card node, you should set it to 16. I think it has nothing to do with RAM usage.

leminhyen2 commented 2 years ago

So usually I leave 10% or so RAM margin by adjusting batch size in case of OOM.

By this, you mean the batch size is hardcoded right?

Also, I didn't realize gradient accumulation steps is --update-freq in Fairseq. Thank you!