unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.25k stars 1.02k forks source link

Qdora:a scalable and memory-efficient method to close the gap between parameter efficient finetuning and full finetuning. #373

Open sorasoras opened 4 months ago

sorasoras commented 4 months ago

https://www.answer.ai/posts/2024-04-26-fsdp-qdora-llama3.html

That looks awesome!

kristaller486 commented 4 months ago

Really awesome

image1
danielhanchen commented 4 months ago

Oh yep saw it! We might instead be adding LoRA+, which has similar results. Technically DoRA is already enabled in Unsloth, but just not optimized

sorasoras commented 4 months ago

Oh yep saw it! We might instead be adding LoRA+, which has similar results. Technically DoRA is already enabled in Unsloth, but just not optimized

dora+qlora work already? I thought you can implement this on top of current implementation of dora

danielhanchen commented 4 months ago

@sorasoras yes but it's not that optimized - only somewhat

adamo1139 commented 4 months ago

@danielhanchen

What are current limitations of QDoRA in unsloth? I can't get it to work with FA2. It seems to work without FA2 but on very low ctx. Should FA2 with QDoRA be supported by current version unsloth or not?

Here's a traceback of FA2 fail in case you want to take a look.

``` Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters are not enabled or a bias term (like in Qwen) is used. Unsloth cannot patch Attention layers with our manual autograd engine since either LoRA adapters are not enabled or a bias term (like in Qwen) is used. Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters are not enabled or a bias term (like in Qwen) is used. Unsloth 2024.4 patched 60 layers with 0 QKV layers, 0 O layers and 0 MLP layers. trainable params: 249,630,720 || all params: 34,638,547,968 || trainable%: 0.7206731651413778 ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \\ /| Num examples = 107,714 | Num Epochs = 1 O^O/ \_/ \ Batch size per device = 1 | Gradient Accumulation steps = 8 \ / Total batch size = 8 | Total steps = 13,464 "-____-" Number of trainable parameters = 249,630,720 0%| | 0/13464 [00:00 sft_trainer.train() File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 361, in train output = super().train(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "", line 361, in _fast_inner_training_loop File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/transformers/trainer.py", line 3138, in training_step loss = self.compute_loss(model, inputs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/transformers/trainer.py", line 3161, in compute_loss outputs = model(**inputs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/accelerate/utils/operations.py", line 825, in forward return model_forward(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/accelerate/utils/operations.py", line 813, in __call__ return convert_to_fp32(self.model_forward(*args, **kwargs)) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/unsloth/models/llama.py", line 882, in PeftModelForCausalLM_fast_forward return self.base_model( File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward return self.model.forward(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/unsloth/models/llama.py", line 813, in _CausalLM_fast_forward outputs = self.model( File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/unsloth/models/llama.py", line 680, in LlamaModel_fast_forward layer_outputs = decoder_layer( File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/unsloth/models/llama.py", line 433, in LlamaDecoderLayer_fast_forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/unsloth/models/llama.py", line 359, in LlamaAttention_fast_forward A = flash_attn_func(Q, K, V, causal = True) File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 831, in flash_attn_func return FlashAttnFunc.apply( File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 511, in forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward( File "/media/adamo/82142F79142F6EFB/ProgramData/Anaconda3/envs/unsloth4/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd( RuntimeError: FlashAttention only support fp16 and bf16 data type 0%| | 0/13464 [00:00
DreamGenX commented 4 months ago

My understanding is that LoRA+ and DoRA are relatively orthogonal and likely stack.

danielhanchen commented 4 months ago

@adamo1139 It should work QLoRA + DoRA - but not optimized - simply turn it with use_dora = True Unsure on that exact error msg, but if you use our Colab notebooks and just set that flag, it should work

@DreamGenX Oh DoRA gets rid of the alpha scaling for LoRA entirely, and makes it learnable

julianstastny commented 2 months ago

@danielhanchen @adamo1139 I had the same error if and only if I used DoRA in my codebase, but I could also confirm that it does work in the colab.

After further investigation I found that the difference is that in the colab, flash-attn is not installed. (x-formers is installed though.)

I then uninstalled flash-attn from where I usually run my code, and this got rid of the error. I wonder if removing flash-attn might lead to significant slow-downs though?

danielhanchen commented 1 month ago

No slowdowns at all!