Closed ElderWanng closed 2 years ago
Seems I didn't set the fp16 correctly----I should start the training by torchrun --nproc_per_node=4
and set --sharded_ddp zero_dp_3
.
@ElderWanng Hey, I'm fairly new to machine learning. Can you tell me where do you run torchrun --nproc_per_node=4
and --sharded_ddp zero_dp_3
? Thank you
@ElderWanng Hey, I'm fairly new to machine learning. Can you tell me where do you run
torchrun --nproc_per_node=4
and--sharded_ddp zero_dp_3
? Thank you
torchrun is a substitutive cli command for 'python -m torch.distributed.launch' in newest pytorch 1.11, see in https://pytorch.org/docs/stable/elastic/run.html
BTW, I'm using deepspeed now, it's faster. The method above is for fairscale.
The start command is
deepspeed --num_gpus=4 run_summarization.py \ --model_name_or_path facebook/bart-large \ --do_train --do_predict \ --dataset_name xsum \ --overwrite_output_dir \ --learning_rate 3e-05 \ --label_smoothing_factor 0.1 \ --greater_is_better True \ --warmup_steps 500 \ --num_train_epochs 10 \ --max_source_length 1024 \ --max_target_length 1024 \ --val_max_target_length 80 \ --gradient_accumulation_steps 2 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 32 \ --num_beams 6 \ --save_strategy steps \ --save_steps 750 \ --evaluation_strategy steps \ --eval_steps 750 \ --load_best_model_at_end True \ --metric_for_best_model loss \ --greater_is_better False \ --predict_with_generate \ --fp16 True \ --deepspeed deepspeed_config/ds_config_zero3.json
the config.json could find in https://github.com/huggingface/transformers/tree/main/tests/deepspeed
This is awesome, thank you for sharing :) Do you think it's possible to include deepspeed or fairscale in the code, like maybe a parameter in Trainer args? Or the only option is to write everything in a py file then run with torchrun or deepspeed?
This is awesome, thank you for sharing :) Do you think it's possible to include deepspeed or fairscale in the code, like maybe a parameter in Trainer args? Or the only option is to write everything in a py file then run with torchrun or deepspeed?
In HF Trainer, I didn't see any workaround (I didn't dive in too much, maybe answer is yes). But I wrote another version in pytorch-lightning, which makes 'launcher as arguments' possible. And I reproduce the same rouge score on XSUM, the training time is the same. If you want I'm glad to share
This is awesome, thank you for sharing :) Do you think it's possible to include deepspeed or fairscale in the code, like maybe a parameter in Trainer args? Or the only option is to write everything in a py file then run with torchrun or deepspeed?
Oh BTW, fairscale is as an extension for torch now. If not to activate fairscale, I still use torchrun as the launcher instead of python run_summarization.py
. For 'launcher as arguments', the answer is yes at least on fairscale. U could simply switch on/off by change --strategy
. I think sharded training is like a default training setting instead of naïve DDP in 2022.
For Deepspeed, I'm not sure whether the HF trainer maintains the clean cli args for us. I just followed the tutorial in https://huggingface.co/docs/transformers/main_classes/trainer
Thank you so much for the insight! I'm currently fine-tuning on Colab and the standard is to run code by cell so I was looking for ways to integrate the speedup in code. But actually now that I look back, fairseq kinda just wrapped most of the training code into the train.py file
Anyway, I'm glad that someone confirmed that Huggingface training can be as fast as Fairseq
Thank you so much for the insight! I'm currently fine-tuning on Colab and the standard is to run code by cell so I was looking for ways to integrate the speedup in code. But actually now that I look back, fairseq kinda just wrapped most of the training code into the train.py file
Anyway, I'm glad that someone confirmed that Huggingface training can be as fast as Fairseq
Now I understand why you have to use args in pure python env. My advice is to try to activate fp16 training, that help a lot. I'm not sure if colab supports fp16 training. I used 4 x RTX8000 card, 10 epochs, about 40mins/epoch.
FP16 is activated and I run the code on one V100 GPU. When I was training fairseq, I enabled FP16 too.
HuggingFace seemed to take a lot longer for one Transformer Base epoch (1 hour) while Fairseq took around just 20-30 min for one Transformer Big epoch. If only HuggingFace can improve the speed by default, it will totally obliterate Fairseq
fairscale or deepspeed would make it to about 40min (still can't catch up with fairseq from 2 years before)
So from your experience with all the optimization you have, it's still not as fast as fairseq? How much slower would you say the most optimal HF training compared with the normal fairseq?
I forget it. It is close enough. I think the sacrifice of training time is worth when it compared to the inconvenience of fairseq.
Ah ok, good to hear, I'll apply all the optimization you mentioned. Hope it will go down to at least 40m an epoch.
Hey, it's me again, I tried a few approaches but surprisingly the most effective one is very simple. I simple increase the batch size in --per_device_train_batch_size
. If you double the batch size, the training is literally twice as fast. One ML engineer told me bigger batch size will enable more efficient parallelization.
Hey, it's me again, I tried a few approaches but surprisingly the most effective one is very simple. I simple increase the batch size in
--per_device_train_batch_size
. If you double the batch size, the training is literally twice as fast. One ML engineer told me bigger batch size will enable more efficient parallelization.
lol, the fairscale or deepspeed will save your GPURAM----by reducing unnecessary optimizer state overhead. Then you could press a bigger batch into training. Seems I forgot to make it clear: open those advanced settings and double the batch size. The fairseq use dynamic batchsize, loading more tokens by current available GPU RAM. But this code is hard to extract and apply to other frameworks. ( TBH I think fairseq in engineering aspect is really hard to say is a good framework ). For me the best settings in HFTrainer is deepzero stage2, no CPU offload, then adjust the batch size to occupy all GPU ram.
What do you want me to do?
On Wed, 18 May 2565 BE at 5:33 am, ElderWanng @.***> wrote:
Hey, it's me again, I tried a few approaches but surprisingly the most effective one is very simple. I simple increase the batch size in --per_device_train_batch_size. If you double the batch size, the training is literally twice as fast. One ML engineer told me bigger batch size will enable more efficient parallelization.
lol, the fairscale or deepspeed will save your GPURAM----by reducing unnecessary optimizer state overhead. Then you could press a bigger batch into training. Seems I forgot to make it clear: open those advanced settings and double the batch size. The fairseq use dynamic batchsize, loading more tokens by current available GPU RAM. But this code is hard to extract and apply to other frameworks. ( TBH I think fairseq in engineering aspect is really hard to say is a good framework ). For me the best settings in HFTrainer is deepzero stage2, no CPU offload, then adjust the batch size to occupy all GPU ram.
— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/17075#issuecomment-1129241051, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWVHH4XFX6M5FG5TCQKTZ3DVKPX7XANCNFSM5VANUWTQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Christina Vongphit iPhone
Hey, it's me again, I tried a few approaches but surprisingly the most effective one is very simple. I simple increase the batch size in
--per_device_train_batch_size
. If you double the batch size, the training is literally twice as fast. One ML engineer told me bigger batch size will enable more efficient parallelization.lol, the fairscale or deepspeed will save your GPURAM----by reducing unnecessary optimizer state overhead. Then you could press a bigger batch into training. Seems I forgot to make it clear: open those advanced settings and double the batch size. The fairseq use dynamic batchsize, loading more tokens by current available GPU RAM. But this code is hard to extract and apply to other frameworks. ( TBH I think fairseq in engineering aspect is really hard to say is a good framework ). For me the best settings in HFTrainer is deepzero stage2, no CPU offload, then adjust the batch size to occupy all GPU ram.
Oh, now I really see how fairscale or deepspeed can reduce GPU RAM. Also, there is one issue I saw in HuggingFace but I don't know if you also encountered it. The GPU RAM seemed to be accumulated when I go from the first epoch to the second epoch, which crashed the training. This totally didn't happen with fairseq. After I include the data path in fairseq, then the GPU usage is always constant, never spiked. I wonder if you have any insight about this issue or if you just YOLO and use deepspeed and it solved everything lol
u might set --eval_accumulation_steps=1,
u might set --eval_accumulation_steps=1,
Oh, this sound very interseting.
I also saw this parameter when training. Gradient accumulation steps = 16
.
Do you think it has something to do with epoch ram accumulation too?
Hey, it's me again, I tried a few approaches but surprisingly the most effective one is very simple. I simple increase the batch size in
--per_device_train_batch_size
. If you double the batch size, the training is literally twice as fast. One ML engineer told me bigger batch size will enable more efficient parallelization.lol, the fairscale or deepspeed will save your GPURAM----by reducing unnecessary optimizer state overhead. Then you could press a bigger batch into training. Seems I forgot to make it clear: open those advanced settings and double the batch size. The fairseq use dynamic batchsize, loading more tokens by current available GPU RAM. But this code is hard to extract and apply to other frameworks. ( TBH I think fairseq in engineering aspect is really hard to say is a good framework ). For me the best settings in HFTrainer is deepzero stage2, no CPU offload, then adjust the batch size to occupy all GPU ram.
Oh, now I really see how fairscale or deepspeed can reduce GPU RAM. Also, there is one issue I saw in HuggingFace but I don't know if you also encountered it. The GPU RAM seemed to be accumulated when I go from the first epoch to the second epoch, which crashed the training. This totally didn't happen with fairseq. After I include the data path in fairseq, then the GPU usage is always constant, never spiked. I wonder if you have any insight about this issue or if you just YOLO and use deepspeed and it solved everything lol
there is a "context manager" is fairseq: when catch cuda OOM error, trying to restore training. I didn't see similar logic in HF. That's what I said about FS having many engineering optimizations. So usually I leave 10% or so RAM margin by adjusting batch size in case of OOM. In terms of exploiting video memory, no other framework is as good as FS.
u might set --eval_accumulation_steps=1,
Oh, this sound very interseting. I also saw this parameter when training.
Gradient accumulation steps = 16
.Do you think it has something to do with epoch ram accumulation too?
updata_freq is just a trick to simulate "multi-card". In there original paper, they fine-tuned it on a 8 card node so this coeficcent is 2. I only have a 4 card node, so I set it to 4. When it comes to 1 card node, you should set it to 16. I think it has nothing to do with RAM usage.
So usually I leave 10% or so RAM margin by adjusting batch size in case of OOM.
By this, you mean the batch size is hardcoded right?
Also, I didn't realize gradient accumulation steps is --update-freq in Fairseq. Thank you!
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Device: Tesla V100-PCIE-32GB * 1
For HF's Trainer:
The estimated time for 1 epoch is about 3.3h
For FairSeq:
The estimated time for 1 epoch is about 1.5h
Expected behavior