philschmid / deep-learning-pytorch-huggingface

MIT License
624 stars 145 forks source link

OOM error in FSDP QLORA setup #60

Open ss8319 opened 1 week ago

ss8319 commented 1 week ago

Hi @philschmid Thanks for your work here.

I am testing out the training/scripts/run_fsdp_qlora.py

My setup includes x4 NVIDIA RTX 4090 GPUs with 24GB memory

I did change from llama 3 to llama3 instruct. But I dont think it will make a difference. l get the OOM error at the quantisation part before it even starts to train. I kept the quantisation to the same 4bit setup. It seems like AnswerAI is able to do it with 2 24GB GPUs on a 70b model, whereas I have 4 GPUs in this case.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB.