I just came from axolotl and I'm impressed! I get 10x faster phi3-mini-4k and I can run the adapter in VLLM (couldn't with axolotl, vLLM says that lm_head module is not supported).
Question 1: Why doesn't increasing the batch size increase the GPU RAM? I was having batch size 50 and then 100 and still the same. On A100 40GB it started at 15GB and then it dropped to 13GB. Changing lora params does have an effect but it looks like batch size doesn't have any effect on the speed?
Question 2: since I have max tokens in my training data at 768, would SFTTrainer's packing = True make it any faster if I also set max_seq_length to 768 * 5?
bsz does and should inc VRAM, however, in some cases, since the sequence lengths are minute, our RAM offloading "masks" it, hence you won't see any increase in VRAM.
Hi,
I just came from axolotl and I'm impressed! I get 10x faster phi3-mini-4k and I can run the adapter in VLLM (couldn't with axolotl, vLLM says that lm_head module is not supported).
Question 1: Why doesn't increasing the batch size increase the GPU RAM? I was having batch size 50 and then 100 and still the same. On A100 40GB it started at 15GB and then it dropped to 13GB. Changing lora params does have an effect but it looks like batch size doesn't have any effect on the speed?
Question 2: since I have max tokens in my training data at 768, would SFTTrainer's packing = True make it any faster if I also set max_seq_length to 768 * 5?
My config:
max_seq_length = 768 dtype = None load_in_4bit = True
model = FastLanguageModel.get_peft_model( model, r = 32, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 16, lora_dropout = 0, bias = "none", use_gradient_checkpointing = "unsloth", random_state = 3407, use_rslora = False, loftq_config = None )
trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset['train'], dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, callbacks=[CustomTrainerCallback], packing = False, args = TrainingArguments( num_train_epochs=1, per_device_train_batch_size = 100, gradient_accumulation_steps = 4, warmup_steps = 5,
max_steps = 20,
)