unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
16.19k stars 1.1k forks source link

Unexpected OOM issue with LORA fine tune of LLama-3 (possibly perf hit as well) #1060

Open devzzzero opened 3 days ago

devzzzero commented 3 days ago

Hi. Something (possibly not unsloth) changed between July and now. I am getting an unexpected OOM error trying to do a LORA finetune. This worked before, but is now barfing. Looked at #338, but nothing immediately came to mind.

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.683 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended,   0%|                                                                                                                                                                                                                                                                                                                                                                                   | 0/769 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ai/LLM/PEFT/./astro.py", line 32, in <module>
    doit(sys.argv[1], sys.argv[2])
  File "/home/ai/LLM/PEFT/./astro.py", line 24, in doit
    AA.trainer.train()
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 361, in train
    output = super().train(*args, **kwargs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "<string>", line 357, in _fast_inner_training_loop
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/transformers/trainer.py", line 3307, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/transformers/trainer.py", line 3338, in compute_loss
    outputs = model(**inputs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 789, in convert_to_fp32
    return recursively_apply(_convert_to_fp32, tensor, test_type=_is_fp16_bf16_tensor)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 118, in recursively_apply
    {
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 119, in <dictcomp>
    k: recursively_apply(
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply
    return func(data, *args, **kwargs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 781, in _convert_to_fp32
    return tensor.float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.83 GiB. GPU 
Traceback (most recent call last):
  File "/home/ai/LLM/PEFT/./astro.py", line 32, in <module>
    doit(sys.argv[1], sys.argv[2])
  File "/home/ai/LLM/PEFT/./astro.py", line 24, in doit
    AA.trainer.train()
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 361, in train
    output = super().train(*args, **kwargs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "<string>", line 357, in _fast_inner_training_loop
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/transformers/trainer.py", line 3307, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/transformers/trainer.py", line 3338, in compute_loss
    outputs = model(**inputs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 789, in convert_to_fp32
    return recursively_apply(_convert_to_fp32, tensor, test_type=_is_fp16_bf16_tensor)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 118, in recursively_apply
    {
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 119, in <dictcomp>
    k: recursively_apply(
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply
    return func(data, *args, **kwargs)
  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 781, in _convert_to_fp32
    return tensor.float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.83 GiB. GPU 

The suspicious bit is this bit in the stack trace:

  File "/home/ai/MiniConda3/envs/unsloth-env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 789, in convert_to_fp32
    return recursively_apply(_convert_to_fp32, tensor, test_type=_is_fp16_bf16_tensor)

This is doing something completely unexpected. (which is converting a bf16 tensor to fp32) (but this may be a red herring). I tried installing accelerate--0.34.1 and accelerate==0.30.0, but neither did anything.

This same procedure worked back in July, but something changed underneath the hood, (possibly not unsloth's fault) and so now I am stuck, and unable to do LORA finetune of llama-3-8b model

I am currently forced to use SFTTrainer with

            per_device_train_batch_size = 1,
            gradient_accumulation_steps = 1,

which is pathetic!

Back in July, the LORA fine tune (approx same size dataset), finished under 3 hours on a RTX-3090, with per_device_train_batch_size=4 and gradient_accumulation_steps=4

I am pretty much following the old unsloth colab from the 2024-05 release. From a cursory glance, the current colab (for llama-3.1, not llama-3), the instructions seem very similar.

I also tried September-2024 tag of unsloth, and it pretty much barfs the same way....

Can anyone send me pointers on where to start diagnosing this? With the current speed of roughly 14 seconds per iteration in the training loop, I am looking at a good 12 hours for something that took 3 hours just 2 months ago!

Help! Thank you!

devzzzero commented 3 days ago

Its running now, per_device_train_batch_size = 1 :-( ETA ~15 hours

danielhanchen commented 3 days ago

Sorry on the delay - will check ASAP and get back to you!

devzzzero commented 3 days ago

Thank you.