How to perform full parameter finetuning without A100 GPUs

ChenDRAG commented 11 months ago

Hi, thank you for your great work! I'd like to reproduce full parameter fine-tuning of dpo training. However I only have 10 * Nvidia A40 GPUs (46 Gbs memory each).

I tried the command

CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml

and it reported OOM error, even if I set batch size to 1.

I don't mind the program runs a bit slower (e.g., use smaller batchsize and more gradient accumulation steps). However, I don't know if there is a way to successfully deploy the full-dpo code.

Can you help me, please?

Also, I'm wondering how large is the performance gap between lora and full parameter finetunning.

alvarobartt commented 11 months ago

Hi @ChenDRAG, did you try running it using the multi_gpu.yaml configuration instead? Maybe the memory optimisations introduced by ZeRO are downgrading the performance of your GPU...

The command would look like the following:

CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml

Other than that I suggest you to try with LoRA if you're having issues with either SFT or DPO, as it will use less memory and requires less resources to run, with 40GB of VRAM you'll be good to go with LoRA.

CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_lora.yaml

ChenDRAG commented 11 months ago

@alvarobartt Thanks a lot for your kind help! However, in the scripts, instructions to reproduce experiments are

# Full training with ZeRO-3 on 8 GPUs
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_{task}.py recipes/{model_name}/{task}/config_full.yaml

# LoRA training on a single GPU
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml

# QLoRA 4-bit training on a single GPU
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml --load_in_4bit=true

# LoRA training with ZeRO-3 on two or more GPUs
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes={num_gpus} scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml

I notice whenever using multiple GPUs care, it is suggested to use deepspeed_zero3 for acceleration. I don't know why.

Could you explain what is the main difference between deepspeed_zero3 and multi_gpu configuration? Is there any potential problem (drawback) if I use multi_gpu.yaml for distributed learning?

ChenDRAG commented 11 months ago

p.s. I tried CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml and it still reports OOM error on 8*46Gb cards.

edbeeching commented 11 months ago

Deepspeed zero3 will shard the model over several GPUs, this should resolve the OOM issues you see. Note we testing on A100 GB GPUS so you may need to tweek the hyperparameters to match your use case.

alvarobartt commented 11 months ago

Also using Flash Attention may decrease the VRAM consumption while training, right? cc @edbeeching

ChenDRAG commented 11 months ago

Deepspeed zero3 will shard the model over several GPUs, this should resolve the OOM issues you see. Note we testing on A100 GB GPUS so you may need to tweek the hyperparameters to match your use case.

Thanks for your help!

I thought different GPUs merely lead to different upper limits of batchsize. Can you tell me which specific hyperparameters I may need to alter other than batchsize and accumulation steps in order to get things work on other GPUs?

tcapelle commented 8 months ago

I would also like more info about this. Do you use Deepspeed to increase batch size? A 7B model fits nicely on 80GB GPUs without any model paralellism.

edbeeching commented 8 months ago

Hi @alvarobartt sorry for the delay. Yes we are using flash attn.

@tcapelle if you have lower GPU memory you can use lora (peft) to perform finetuning.

tcapelle commented 8 months ago

Thanks for the prompt response =). BTW outstanding preso at DL.ai @edbeeching !

What I am curious is why use Deepspeed zero3 when using 80GB GPUs, is it faster? or it is to increase batch size? I have a node of 8x80GB

edbeeching commented 8 months ago

Thanks @tcapelle zero3 shards the optimizer state, grads and model weights across GPUs. So you should have more memory available. However, if you are tuning a 7b model you may not need to shard, as you will probably be running DDP=8.

tcapelle commented 8 months ago

Yes, but in the Readme:

Full fine-tuning on a multi-GPU machine with DeepSpeed ZeRO-3 (tested on an 8 x A100 (80GB) node)

I am curious about why you chose to shard with big GPUs available, maybe I am missing something.

edbeeching commented 8 months ago

This is so the config is compatible with a larger model, e.g. llama-2-70b. I think that for a 7b model no sharding will take place.

tcapelle commented 8 months ago

The DPO recipe with a 7b model with config_full get's me OOM so I was wondering what should I reduce to keep the recipe consistent

I am on 8xA100 80GB

huggingface / alignment-handbook

How to perform full parameter finetuning without A100 GPUs #22