unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.41k stars 1.29k forks source link

Different batch size (1,2,4), same training speed #1155

Open fzyzcjy opened 1 month ago

fzyzcjy commented 1 month ago

Hi thanks for the library! When using Unsloth to SFT a llama3.2-1B on 4090D, I find something interesting: changing batch size from 1 to 4 does not speed up training.

Three configuration experiments:

            per_device_train_batch_size=2, gradient_accumulation_steps=8,
            per_device_train_batch_size=4, gradient_accumulation_steps=4,
            per_device_train_batch_size=1, gradient_accumulation_steps=16,

Speed: 100 step (i.e. 100x16 samples) takes 70s, 70s, 75s.

It seems common that batch size as small as 1 will lead to slow throughput. Thus I create this issue, in case it is related to something in Unsloth that can be further optimized, i.e. make the batch_size=2 or 4 be faster than the 1 case.

danielhanchen commented 1 month ago

@fzyzcjy It's entirely possible it's the padding of the large batches that's slowly things down - one trick is to set group_by_length = True to reduce padding

fzyzcjy commented 1 month ago

@danielhanchen I see, thank you!

Btw, I wonder whether group_by_length will be harmful for model performance, e.g. because it lets the gradient of each step be computed by samples of similar lengths.

danielhanchen commented 1 month ago

Yes it will hurt the training process sadly

fzyzcjy commented 1 month ago

I see. Thank you for explanations!

(Btw I am happy to PR to implement https://github.com/unslothai/unsloth/issues/1021, feel free to ping me if needed)