Open ChenDRAG opened 11 months ago
Hi @ChenDRAG, did you try running it using the multi_gpu.yaml
configuration instead? Maybe the memory optimisations introduced by ZeRO are downgrading the performance of your GPU...
The command would look like the following:
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml
Other than that I suggest you to try with LoRA if you're having issues with either SFT or DPO, as it will use less memory and requires less resources to run, with 40GB of VRAM you'll be good to go with LoRA.
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_lora.yaml
@alvarobartt Thanks a lot for your kind help!
However, in the scripts
, instructions to reproduce experiments are
# Full training with ZeRO-3 on 8 GPUs
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_{task}.py recipes/{model_name}/{task}/config_full.yaml
# LoRA training on a single GPU
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml
# QLoRA 4-bit training on a single GPU
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml --load_in_4bit=true
# LoRA training with ZeRO-3 on two or more GPUs
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes={num_gpus} scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml
I notice whenever using multiple GPUs care, it is suggested to use deepspeed_zero3 for acceleration. I don't know why.
Could you explain what is the main difference between deepspeed_zero3 and multi_gpu configuration? Is there any potential problem (drawback) if I use multi_gpu.yaml for distributed learning?
p.s.
I tried
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml
and it still reports OOM error on 8*46Gb cards.
Deepspeed zero3 will shard the model over several GPUs, this should resolve the OOM issues you see. Note we testing on A100 GB GPUS so you may need to tweek the hyperparameters to match your use case.
Also using Flash Attention may decrease the VRAM consumption while training, right? cc @edbeeching
Deepspeed zero3 will shard the model over several GPUs, this should resolve the OOM issues you see. Note we testing on A100 GB GPUS so you may need to tweek the hyperparameters to match your use case.
Thanks for your help!
I thought different GPUs merely lead to different upper limits of batchsize. Can you tell me which specific hyperparameters I may need to alter other than batchsize and accumulation steps in order to get things work on other GPUs?
I would also like more info about this. Do you use Deepspeed to increase batch size? A 7B model fits nicely on 80GB GPUs without any model paralellism.
Hi @alvarobartt sorry for the delay. Yes we are using flash attn.
@tcapelle if you have lower GPU memory you can use lora (peft) to perform finetuning.
Thanks for the prompt response =). BTW outstanding preso at DL.ai @edbeeching !
What I am curious is why use Deepspeed zero3 when using 80GB GPUs, is it faster? or it is to increase batch size? I have a node of 8x80GB
Thanks @tcapelle zero3 shards the optimizer state, grads and model weights across GPUs. So you should have more memory available. However, if you are tuning a 7b model you may not need to shard, as you will probably be running DDP=8.
Yes, but in the Readme:
Full fine-tuning on a multi-GPU machine with DeepSpeed ZeRO-3 (tested on an 8 x A100 (80GB) node)
I am curious about why you chose to shard with big GPUs available, maybe I am missing something.
This is so the config is compatible with a larger model, e.g. llama-2-70b. I think that for a 7b model no sharding will take place.
The DPO recipe with a 7b model with config_full get's me OOM so I was wondering what should I reduce to keep the recipe consistent
I am on 8xA100 80GB
Hi, thank you for your great work! I'd like to reproduce full parameter fine-tuning of dpo training. However I only have 10 * Nvidia A40 GPUs (46 Gbs memory each).
I tried the command
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml
and it reported OOM error, even if I set batch size to 1.
I don't mind the program runs a bit slower (e.g., use smaller batchsize and more gradient accumulation steps). However, I don't know if there is a way to successfully deploy the full-dpo code.
Can you help me, please?
Also, I'm wondering how large is the performance gap between lora and full parameter finetunning.