unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.57k stars 1.05k forks source link

Inference speed is so slow #818

Open rumanxyz opened 1 month ago

rumanxyz commented 1 month ago

Hey,

I had used unsloth for faster finetuning of gemma2 9B, with default configuration as suggested by unsloth.

Here’s the public colab of same https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing

Only difference is I used unsloth/gemma-2-9b-it-bnb-4bit model instead of unsloth/gemma-2-9b as suggested by unsloth.

I have total 2028 validation samples, during training validation was taking ~30 minutes after every 300 steps which we had configured for.

But now when I'm running the model on 2028 set its showing to take estimated 15hrs on the same vm.

I had used the validation batch size of 1, during training and at inference time both.

I'm using A100 40GB machine.

Here's code i'm using to load the model ( Ps these weights are checkpoints saved during training, im using one with lowest val and train loss.)

from unsloth import FastLanguageModel
import torch
max_seq_length = 8000 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True 

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "gemma2_exp1/checkpoint-5100", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

Function to run the model inference

import time
def get_model_infer(prompt):

    inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")
    start_time = time.time()
    outputs = model.generate(**inputs, max_new_tokens = 1500, use_cache = True)
    total_time = round(time.time() - start_time, 2)

    token_len = len(inputs[0])

    # print(f"total TAT : {}")

    return token_len, total_time, tokenizer.batch_decode(outputs)[0]

I'm calling get_model_infer() to get the model generation output for the given prompt and appending the response to a list.

Here's code used during the training

  1. To load the pre-trained model
    
    from unsloth import FastLanguageModel
    import torch
    max_seq_length = 8000 # Choose any! We auto support RoPE Scaling internally!
    dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/gemma-2-9b-it-bnb-4bit", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit,

token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf

)

This is the op for above code snippet :

WARNING:xformers:WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 2.3.0+cu121 with CUDA 1201 (you have 2.2.0+cu121) Python 3.10.14 (you have 3.10.14) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MOREDETAILS=1 for more details 🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. ==((====))== Unsloth: Fast Gemma2 patching release 2024.7 \ /| GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.381 GB. Platform = Linux. O^O/ \/ \ Pytorch: 2.2.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1. \ / Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False] "-____-" Free Apache license: http://github.com/unslothai/unsloth

2. To set the Lora adapter

model = FastLanguageModel.get_peft_model( model, r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 16, lora_dropout = 0, # Supports any, but = 0 is optimized bias = "none", # Supports any, but = "none" is optimized

[NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!

use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False,  # We support rank stabilized LoRA
loftq_config = None, # And LoftQ

)


Sft trainer config

from trl import SFTTrainer from transformers import TrainingArguments from unsloth import is_bfloat16_supported

epochs = 10 batch_size = 2 steps = int(df_train.shape[0] / batch_size) + 1 total_steps = steps * epochs

trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = data_train, eval_dataset = data_eval, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 4, packing = False, # Can make training 5x faster for short sequences. args = TrainingArguments( per_device_train_batch_size = batch_size, per_device_eval_batch_size = 1, gradient_accumulation_steps = 2, warmup_steps = 5, learning_rate = 2e-4, fp16 = not is_bfloat16_supported(), bf16 = is_bfloat16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "gemma2_exp1", num_train_epochs=epochs, max_steps=total_steps, save_steps=300, save_total_limit=2, evaluation_strategy="steps", eval_steps=300, do_eval=True, ), )



I'm new to unsloth, please help me 
DaddyCodesAlot commented 1 month ago

Looking at your code, you're training for 10 epochs, so that means you're going through your entire dataset 10 times. You also have your batch size per step is 2, so that means you're only looking at 2 examples per step. Your gradient accumulation is set to 2, so that means in general you're processing about 4 examples per training step. So that means you're gonna run quite a few steps just for finetuning alone.

Also, your validation batch size is set to 1 and you have 2028 examples, that means you're effectively running 2028 evaluation steps, every 300 training steps, which is gonna take quite some time.

Also, I'm not sure if a 9 billion parameter model can pick up much on such a huge dataset with such a low batch size, especially with a rank of 16. Maybe scale your dataset down, increase the rank or use a larger model like Gemma 27b. Also, 10 epochs is very overkill for finetuning a LORA, typically most people are content with just 3.

danielhanchen commented 1 month ago

Thanks for the detailed help @DaddyCodesAlot :) - agreed on using larger models like Gemma-2 and ye 10 epochs is wayyyy overkill!! Appreciate it!

On the slow generation part - @rumanxyz Apologies on the slow response - have you tried FastLanguageModel.for_inference(model)

rumanxyz commented 1 month ago

Yes, I'm using that.

Here's code which im using to load the model

from unsloth import FastLanguageModel
import torch
max_seq_length = 8000 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True 

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "gemma2_exp1/checkpoint-4800", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

@DaddyCodesAlot and yes this is overkill, but here focus is just to bench the TPS on different machine.