Full Finetune with Unsloth

unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

18.03k stars 1.25k forks source link

Full Finetune with Unsloth #1021

Open user074 opened 2 months ago

user074 commented 2 months ago

I am just curious whether the current unsloth support the full finetune. So when I am experimenting training tinyllama model on 24GB vram GPU right now. Using unsloth to just load the model without lora or anything would only take about 10GB vram. But when I use transformer's AutoModelForCausalLM it would be close to 24GB vram. It seems that unsloth works well for full fine tune even with just load with FastLanguageModel?

I know current version claimed that it is not supporting full fine tune yet, but I wonder whether it is a full fine tune through just loading with FastLanguageModel.

Basically I just load the model and tokenizer with FastLanguageModel.from_pretrained Then I directly use the model in SFTTrainer. And the memory is significantly less.

danielhanchen commented 1 month ago

In theory it works, but some weights will not be trained - ie the RMS Layernorm weights and weights for the MLP layers - you could ignore .get_peft_model and I guess it could partially work

fzyzcjy commented 3 weeks ago

@danielhanchen Hi may I know whether Unsloth still does not support full finetune today? Since unsloth is fast and memory-efficient, it would be super great to have it supported. Thanks!

fzyzcjy commented 3 weeks ago

Made a quick experiment as below. Seems that the layer norm weights are never changed, while other parameters are changed.

fzyzcjy commented 3 weeks ago

@danielhanchen I am happy to PR to make the layernorm work (if it is the only missing piece)! IMHO full finetune is really frequently needed, and with small models like qwen2.5-0.5B or qwen2.5-math-1.5B, it is possible to full finetune on cards like rtx4090.

danielhanchen commented 2 weeks ago

@fzyzcjy The layernorm weights should be a bit complex I guess to implement - ie I think they're just the sum of the gradients in the correct axis (maybe axis = 1?) ie add up all the rows.

But one has to first derive the gradient for the weights ie dC/dW which I skip, and instead just do dC/dX

fzyzcjy commented 2 weeks ago

@danielhanchen No worries, if it is implementable I think I will try to do that.

But the first (and most important) problem: Is there any other missing pieces to make unsloth do full finetuning?

(Wait a few minutes and I will create a new issue saying more details)

fzyzcjy commented 2 weeks ago

My thoughts here: https://github.com/unslothai/unsloth/issues/1176