Hi,

I just came from axolotl and I'm impressed! I get 10x faster phi3-mini-4k and I can run the adapter in VLLM (couldn't with axolotl, vLLM says that lm_head module is not supported).

Question 1: Why doesn't increasing the batch size increase the GPU RAM? I was having batch size 50 and then 100 and still the same. On A100 40GB it started at 15GB and then it dropped to 13GB. Changing lora params does have an effect but it looks like batch size doesn't have any effect on the speed?

Question 2: since I have max tokens in my training data at 768, would SFTTrainer's packing = True make it any faster if I also set max_seq_length to 768 * 5?

My config:

max_seq_length = 768 dtype = None load_in_4bit = True

model = FastLanguageModel.get_peft_model( model, r = 32, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 16, lora_dropout = 0, bias = "none", use_gradient_checkpointing = "unsloth", random_state = 3407, use_rslora = False, loftq_config = None )

trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset['train'], dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, callbacks=[CustomTrainerCallback], packing = False, args = TrainingArguments( num_train_epochs=1, per_device_train_batch_size = 100, gradient_accumulation_steps = 4, warmup_steps = 5,

max_steps = 20,

    learning_rate = 2e-4,
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 1,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 3407,
    output_dir = "outputs",
),

)

unslothai / unsloth

Why is GPU RAM not increasing when increasing batch size? #714

max_steps = 20,