mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.99k stars 1.56k forks source link

[Bug] Phi-2 slow performance #1781

Closed dusty-nv closed 5 months ago

dusty-nv commented 8 months ago

🐛 Bug

I have been able to compile Phi-2 model with mlc_chat compile, however even as the smallest model profiled, it vastly underperforms other models:

Model Tokens/sec
phi-2-q4f16_1 30
stablelm-3b-4e1t-q4f16_1 66
stablelm-3b-4e1t-q4f16_ft 94
Llama-2-7b-chat-hf-q4f16_1 36
Llama-2-7b-chat-hf-q4f16_ft 45

Perhaps part of it is that it's missing q4f16_ft quantization support (see https://github.com/mlc-ai/mlc-llm/issues/1780), however even when comparing it's performance to other models using q4f16_1 quantization, it seems like something is awry for it to be that slow. Is there something inherit in the Phi-2 LLM architecture that prevents it being optimized, or some other issue occurring?

(this was on Jetson AGX Orin btw, do you see similar perf on 4090 relative to other models?)

Environment

Hzfengsy commented 8 months ago

It's quite interesting to see such a huge gap between q4f16_1 and q4f16_ft on AGX Orin. cc @vinx13

louis030195 commented 5 months ago

Thanks for sharing !

Would be amazing to see somewhere perfs of different models on different hardware, esp. new models like llama3. Would be interested to know the perf of llama3 on jetson origin