mikeizbicki / modulus-magnus-linguae

8 stars 6 forks source link

LLAMA Finetuning #26

Open irajmoradi opened 1 year ago

irajmoradi commented 1 year ago

Hello,

I have tried to finetune the 7B Parameter LLAMA model locally on the Lamda server, using this repo as a guide. I set the number of devices to 8, and have tried messing with the hyperparameters in regard to batch size, and still keep getting the same error.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 3; 10.76 GiB total capacity; 10.21 GiB already allocated; 24.56 MiB free; 10.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

This happens regardless of doing the Lora fine-tuning method, or the adapter fine-tuning method which implements deepspeed which should allow for multi-gpu processing. On the github tutorial for fine-tuning they make mention to how they use a GPU with 24 gb of vram, while the 2080's in the lamda server only have 11 GB of GPU Memory. The smallest vram requirement I have seen is on this github repo which says that even with Deepspeed and Lora that a 3B parameter model requires 9gb of GPU Memory, and a 12B parameter model 22gb of GPU Memory.

I've heard that FSDP could be a technique that could be used in order to help split it off into multiple GPU's, but I haven't found a tutorial or github repo that utilizes FSDP for finetuning.

mikeizbicki commented 1 year ago

For now, I would try finetuning without the GPU and just on the CPU. Normally the CPU is much slower, but because the lambda server has 80 CPUs, it's usually actually a bit faster to use CPUs instead of a single GPU.

The finetuning will probably still be really slow. We'll just use this to get a basic pipeline setup before we move to a faster setup.

Alternatively, the QCL has a machine with 4 V100 GPUs (which I believe have 24GB of RAM each and is what a lot of the tutorials target). My guess is that they're not currently being used, and so you could talk to Prof Park about using this machine.

irajmoradi commented 1 year ago

I got the finetuning on just CPU working, however, it appears it will take around a week* per fine-tune on 7B assuming that nothing goes wrong. I also just sent an email to Prof. Park about getting access to the QCL machine so that finetuning can go faster and so that we could maybe even do 13B finetuning .

*With the hyperparameter that was set in the file itself, I am going to do a test run with a smaller amount of iterations this weekend just so I could test whether the function's output works and how to deploy the fine-tuned model.