unslothai / unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
12.27k stars 796 forks source link

Why is GPU RAM not increasing when increasing batch size? #714

Open adivoj opened 3 days ago

adivoj commented 3 days ago

Hi,

I just came from axolotl and I'm impressed! I get 10x faster phi3-mini-4k and I can run the adapter in VLLM (couldn't with axolotl, vLLM says that lm_head module is not supported).

Question 1: Why doesn't increasing the batch size increase the GPU RAM? I was having batch size 50 and then 100 and still the same. On A100 40GB it started at 15GB and then it dropped to 13GB. Changing lora params does have an effect but it looks like batch size doesn't have any effect on the speed?

Question 2: since I have max tokens in my training data at 768, would SFTTrainer's packing = True make it any faster if I also set max_seq_length to 768 * 5?

My config:

max_seq_length = 768 dtype = None load_in_4bit = True

model = FastLanguageModel.get_peft_model( model, r = 32, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 16, lora_dropout = 0, bias = "none", use_gradient_checkpointing = "unsloth", random_state = 3407, use_rslora = False, loftq_config = None )

trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset['train'], dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, callbacks=[CustomTrainerCallback], packing = False, args = TrainingArguments( num_train_epochs=1, per_device_train_batch_size = 100, gradient_accumulation_steps = 4, warmup_steps = 5,

max_steps = 20,

    learning_rate = 2e-4,
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 1,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 3407,
    output_dir = "outputs",
),

)

danielhanchen commented 3 days ago

Thanks for the kind words :)

  1. bsz does and should inc VRAM, however, in some cases, since the sequence lengths are minute, our RAM offloading "masks" it, hence you won't see any increase in VRAM.
  2. Packing should make it faster