regarding inference speed

unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

15.24k stars 1.02k forks source link

regarding inference speed #517

Open Shuaib11-Github opened 3 months ago

Shuaib11-Github commented 3 months ago

Hello i have fine tuned Phi-3 model using unsloth everything works fine but the issue is inference time. In the colab notebook it is mentioned ass 2x fasster inference, but when I checked with original(untuned model) and fine tuned model the original model is producing faster inference for Alpaca dataset example.

Can you share any insights on this why it is slower than original model during inference even though it is mentioned as 2x faster inference.

danielhanchen commented 3 months ago

@Shuaib11-Github Oh yes you asked in Discord!

Unsloth inference makes LoRA / QLoRA 2x faster. You benchmarked HF without any adapters. Best to merge then benchmark.
Your HF model output has shorter tokens than Unsloth, so a better way is to take the time taken divided by the number of generated tokens
Phi-3 mini 4bit might actually be slower than pure 16bit, since memory bandwidth is less so of an issue.

Shuaib11-Github commented 3 months ago

Still the unsloth model is slower than original model.

Check the colab link and suggest me the changes

https://colab.research.google.com/drive/1LLWoaQrH8KFkQlE4ONwwtC4tC1-1It2X?usp=sharing

danielhanchen commented 3 months ago

@Shuaib11-Github Oh yes I checked and responded on Discord:

Unsloth 16bit is 2x faster than HF inference. 4bit is ~1.42x faster than HF using ur exact notebook, and also using a new prompt "Write 1 to infinity." for a fair comparison also u forgot to use FastLanguageModel.for_inference(model) for Unsloth inference another issue is u need to run it twice for warmup, so its a bit slower at the start

danielhanchen commented 3 months ago

@Shuaib11-Github I made 2 reproducible notebooks using your exact example.

Fast Unsloth 16bit version 2x faster takes 5.94s / 3.33s / 2.6s https://colab.research.google.com/drive/1C9DDEtZD1zKVSh3zG1dIflP5GXoT8s-e?usp=sharing
Slow HF 16bit version takes 7.77s / 5.11s / 4.19s: https://colab.research.google.com/drive/1NUWR7waGzCbnoGxfokFqhxMIpn8Zyka-?usp=sharing

Both have warmup periods, which is normal. Unsloth is 1.31x faster for run 1 Unsloth is 1.54x faster for run 2 Unsloth is 1.62x faster for run 3

It'll be much faster and approachs 2x with longer sequences, and also when you load in 4bit (these timings are 16bit).

Try the notebooks yourself to confirm if my timings are correct.