Open Shuaib11-Github opened 3 months ago
@Shuaib11-Github Oh yes you asked in Discord!
Still the unsloth model is slower than original model.
Check the colab link and suggest me the changes
https://colab.research.google.com/drive/1LLWoaQrH8KFkQlE4ONwwtC4tC1-1It2X?usp=sharing
@Shuaib11-Github Oh yes I checked and responded on Discord:
Unsloth 16bit is 2x faster than HF inference. 4bit is ~1.42x faster than HF
using ur exact notebook, and also using a new prompt "Write 1 to infinity." for a fair comparison
also u forgot to use FastLanguageModel.for_inference(model)
for Unsloth inference
another issue is u need to run it twice for warmup, so its a bit slower at the start
@Shuaib11-Github I made 2 reproducible notebooks using your exact example.
Both have warmup periods, which is normal. Unsloth is 1.31x faster for run 1 Unsloth is 1.54x faster for run 2 Unsloth is 1.62x faster for run 3
It'll be much faster and approachs 2x with longer sequences, and also when you load in 4bit (these timings are 16bit).
Try the notebooks yourself to confirm if my timings are correct.
Hello i have fine tuned Phi-3 model using unsloth everything works fine but the issue is inference time. In the colab notebook it is mentioned ass 2x fasster inference, but when I checked with original(untuned model) and fine tuned model the original model is producing faster inference for Alpaca dataset example.
Can you share any insights on this why it is slower than original model during inference even though it is mentioned as 2x faster inference.