Closed dusty-nv closed 5 months ago
It's quite interesting to see such a huge gap between q4f16_1
and q4f16_ft
on AGX Orin. cc @vinx13
Thanks for sharing !
Would be amazing to see somewhere perfs of different models on different hardware, esp. new models like llama3. Would be interested to know the perf of llama3 on jetson origin
🐛 Bug
I have been able to compile Phi-2 model with
mlc_chat compile
, however even as the smallest model profiled, it vastly underperforms other models:phi-2-q4f16_1
stablelm-3b-4e1t-q4f16_1
stablelm-3b-4e1t-q4f16_ft
Llama-2-7b-chat-hf-q4f16_1
Llama-2-7b-chat-hf-q4f16_ft
Perhaps part of it is that it's missing
q4f16_ft
quantization support (see https://github.com/mlc-ai/mlc-llm/issues/1780), however even when comparing it's performance to other models usingq4f16_1
quantization, it seems like something is awry for it to be that slow. Is there something inherit in the Phi-2 LLM architecture that prevents it being optimized, or some other issue occurring?(this was on Jetson AGX Orin btw, do you see similar perf on 4090 relative to other models?)
Environment
conda
, source): sourcepip
, source): sourcepython -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):f8b2ff1bfea46ddbab905267f782c3fd9482a470