huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.28k stars 367 forks source link

How to QLoRA training with ZeRO-3 on two or more GPUs? #42

Open Di-Zayn opened 8 months ago

Di-Zayn commented 8 months ago

I added a 4-bit load after the command LoRA training with ZeRO-3 on two or more GPUs to achieve a mix of QLoRA and ZeRO-3. But the program encountered the following error: RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7f2ec8daf900> The command is: ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=2 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_lora.yaml --load_in_4bit=true

alvarobartt commented 8 months ago

Hi @Di-Zayn, note that you will need to also modify the configuration used for DeepSpeed ZeRO 3, as the one they share is the one is suited for a VM with 8 x A100 80GB, so to suit your needs you may need to add the flags required to load and train using a lower precision.

Anyway not sure about how to fine-tune that using NF4, but maybe https://www.deepspeed.ai/tutorials/MoQ-tutorial/#deepspeed-configuration-file is worth checking?

laphang commented 2 months ago

I'm getting this issue as well (trying qlora with ZeRO-3 and 4 gpus, same error message), @Di-Zayn were you able to solve it?

Serega6678 commented 2 months ago

I had similar problems and I decided to use the multi_gpu script and set the param to just use 2 GPUs and everything was working fine: https://github.com/huggingface/alignment-handbook/blob/main/recipes/accelerate_configs/multi_gpu.yaml

However, on the Zero code, the starting loss was like 1.7 instead of 1.4 with the multi-gpu script both when using 1 or 2 GPUs

I never bothered further experimenting with Zero as I got the results I needed with multi_gpu script

laphang commented 2 months ago

I was keen on sharding the model across gpus in order to be able to allow for larger models.

As an aside, the latest FSDP and qlora examples are working for me - that works for my use case https://github.com/huggingface/alignment-handbook/commit/606d2e954fd17999af40e6fb4f712055ca11b2f0