Training on LORA using multi-gpu is giving constant loss

huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences

https://huggingface.co/HuggingFaceH4

Apache License 2.0

4.28k stars 367 forks source link

Training on LORA using multi-gpu is giving constant loss #73

Open sids07 opened 7 months ago

sids07 commented 7 months ago

I am trying to train yi-34B model using LORA setup on multi-gpu. But i am getting constant loss i.e. around 2 throughout my SFT training on 4 epochs. And inferencing trained model is giving useless response.

DRXD1000 commented 7 months ago

could you please provide the code used for training?

and some infos about the gpus you used

sids07 commented 7 months ago

@DRXD1000 I am using sft trainer script from this repo: https://github.com/huggingface/alignment-handbook/blob/main/scripts/run_sft.py

Regarding GPU i am using 4*A100 (80GB) GPU from runpods.

DRXD1000 commented 6 months ago

Hm.. I guess there is either a cuda problem if you are doing 4bit or 8bit training or something wrong with your training data or script.

fusesid commented 6 months ago

@DRXD1000 i am sure there is no problem on data or script as i am following same script i am able to finetune models upto 34B but for mixtral 8*7B i am facing CUDAOUTOFMEMORY so i wanted to try LORA. Also with LORA same script works fine for single GPU. And i am not doing 4bit or 8bit, though i have activate bfloat16.

DRXD1000 commented 6 months ago

Mabey you could try with torch.autocast("cuda"): trainer.train()

If this does not work you could try the script from the mixtral blogpost on huggingface https://huggingface.co/blog/mixtral#fine-tuning-with-%F0%9F%A4%97-trl