Open rumanxyz opened 1 month ago
Looking at your code, you're training for 10 epochs, so that means you're going through your entire dataset 10 times. You also have your batch size per step is 2, so that means you're only looking at 2 examples per step. Your gradient accumulation is set to 2, so that means in general you're processing about 4 examples per training step. So that means you're gonna run quite a few steps just for finetuning alone.
Also, your validation batch size is set to 1 and you have 2028 examples, that means you're effectively running 2028 evaluation steps, every 300 training steps, which is gonna take quite some time.
Also, I'm not sure if a 9 billion parameter model can pick up much on such a huge dataset with such a low batch size, especially with a rank of 16. Maybe scale your dataset down, increase the rank or use a larger model like Gemma 27b. Also, 10 epochs is very overkill for finetuning a LORA, typically most people are content with just 3.
Thanks for the detailed help @DaddyCodesAlot :) - agreed on using larger models like Gemma-2 and ye 10 epochs is wayyyy overkill!! Appreciate it!
On the slow generation part - @rumanxyz Apologies on the slow response - have you tried FastLanguageModel.for_inference(model)
Yes, I'm using that.
Here's code which im using to load the model
from unsloth import FastLanguageModel
import torch
max_seq_length = 8000 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "gemma2_exp1/checkpoint-4800", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
@DaddyCodesAlot and yes this is overkill, but here focus is just to bench the TPS on different machine.
Hey,
I had used unsloth for faster finetuning of gemma2 9B, with default configuration as suggested by unsloth.
Here’s the public colab of same https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing
Only difference is I used
unsloth/gemma-2-9b-it-bnb-4bit
model instead ofunsloth/gemma-2-9b
as suggested by unsloth.I have total 2028 validation samples, during training validation was taking ~30 minutes after every 300 steps which we had configured for.
But now when I'm running the model on 2028 set its showing to take estimated 15hrs on the same vm.
I had used the validation batch size of 1, during training and at inference time both.
I'm using A100 40GB machine.
Here's code i'm using to load the model ( Ps these weights are checkpoints saved during training, im using one with lowest val and train loss.)
Function to run the model inference
I'm calling
get_model_infer()
to get the model generation output for the given prompt and appending the response to a list.Here's code used during the training
model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/gemma-2-9b-it-bnb-4bit", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit,
token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
WARNING:xformers:WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 2.3.0+cu121 with CUDA 1201 (you have 2.2.0+cu121) Python 3.10.14 (you have 3.10.14) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MOREDETAILS=1 for more details 🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. ==((====))== Unsloth: Fast Gemma2 patching release 2024.7 \ /| GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.381 GB. Platform = Linux. O^O/ \/ \ Pytorch: 2.2.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1. \ / Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False] "-____-" Free Apache license: http://github.com/unslothai/unsloth
model = FastLanguageModel.get_peft_model( model, r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 16, lora_dropout = 0, # Supports any, but = 0 is optimized bias = "none", # Supports any, but = "none" is optimized
[NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
)
from trl import SFTTrainer from transformers import TrainingArguments from unsloth import is_bfloat16_supported
epochs = 10 batch_size = 2 steps = int(df_train.shape[0] / batch_size) + 1 total_steps = steps * epochs
trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = data_train, eval_dataset = data_eval, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 4, packing = False, # Can make training 5x faster for short sequences. args = TrainingArguments( per_device_train_batch_size = batch_size, per_device_eval_batch_size = 1, gradient_accumulation_steps = 2, warmup_steps = 5, learning_rate = 2e-4, fp16 = not is_bfloat16_supported(), bf16 = is_bfloat16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "gemma2_exp1", num_train_epochs=epochs, max_steps=total_steps, save_steps=300, save_total_limit=2, evaluation_strategy="steps", eval_steps=300, do_eval=True, ), )