[Bug] Phi-2 slow performance

dusty-nv commented 8 months ago

🐛 Bug

I have been able to compile Phi-2 model with mlc_chat compile, however even as the smallest model profiled, it vastly underperforms other models:

Model	Tokens/sec
`phi-2-q4f16_1`	30
`stablelm-3b-4e1t-q4f16_1`	66
`stablelm-3b-4e1t-q4f16_ft`	94
`Llama-2-7b-chat-hf-q4f16_1`	36
`Llama-2-7b-chat-hf-q4f16_ft`	45

Perhaps part of it is that it's missing q4f16_ft quantization support (see https://github.com/mlc-ai/mlc-llm/issues/1780), however even when comparing it's performance to other models using q4f16_1 quantization, it seems like something is awry for it to be that slow. Is there something inherit in the Phi-2 LLM architecture that prevents it being optimized, or some other issue occurring?

(this was on Jetson AGX Orin btw, do you see similar perf on 4090 relative to other models?)

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): ARM64+CUDA
Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) Jetson AGX Orin
How you installed MLC-LLM (conda, source): source
How you installed TVM-Unity (pip, source): source
Python version (e.g. 3.10): 3.10
GPU driver version (if applicable): JetPack 6.0 DP / L4T R36.2.0
CUDA/cuDNN version (if applicable): CUDA 12.2
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): f8b2ff1bfea46ddbab905267f782c3fd9482a470
Any other relevant information:

Hzfengsy commented 8 months ago

It's quite interesting to see such a huge gap between q4f16_1 and q4f16_ft on AGX Orin. cc @vinx13

louis030195 commented 5 months ago

Thanks for sharing !

Would be amazing to see somewhere perfs of different models on different hardware, esp. new models like llama3. Would be interested to know the perf of llama3 on jetson origin

mlc-ai / mlc-llm

[Bug] Phi-2 slow performance #1781

🐛 Bug

Environment