Closed sbwww closed 8 months ago
Haven't tested
q4f16_0
quantization yet.
Similar pattern found in q4f16_0
quantization. SLM GPU usage ~83%, legacy GPU usage ~89%
Notably, SLM barely reaches 6.5 token/s with q4f16_0
, and legacy is 11.5 token/s (prefill 42 tokens + decode 300 tokens)
Might need double-check before deprecating legacy workflow #1886
I can confirm that I experience this also.
SLM flow is using the Paged Attention kernel, which causes the perf regression since it's not tuned on Android.
The attention regression should be fixed by #1915 . In the meantime, given all the known gaps are ready, we will proceed with the deprecation, so followup steps can move forward @srkreddy1238 would be great if you can also help bring some of the q4f16_0
optimizations to the new flow
Thank you @sbwww for reporting and love to continue work together improving the new flow
We find q4f16_0 being more convenient for Adreno (though we tried improving q4f16_1 initially). Now q4f16_0 compatible dlight schedules (GEMV, MatMul) are improved. We fell shot of earlier performance with PagedAttn regressing. Let me give a try with https://github.com/mlc-ai/mlc-llm/pull/1915.
I feel the heat with somany options here, will start giving out the internal optimizations for Adreno soon
Love to see these land @srkreddy1238 ! i know there can be a slight setbacks due to migration, but hopefully the new engine would offer path towards more useful things like speculation and more
Here the PR's for Adreno improvements with SLM flow.
https://github.com/mlc-ai/mlc-llm/pull/2215 : Enable OpenCL Host Ptr usage for Android builds https://github.com/mlc-ai/mlc-llm/pull/2214 : Restored cli utility to work with Android targets (no python here) https://github.com/mlc-ai/mlc-llm/pull/2216 : Thread limit update for Adreno OpenCL https://github.com/apache/tvm/pull/16929 : Enable HostPtr (memory mappe) based data copy https://github.com/mlc-ai/relax/pull/319 : The schedule improvements for q4f16_0 schema
All these changes put together can push the decode performance up to 40% on Snapdragon Gen 3 from the current baseline.
Request to review
❓ General Questions
I tried both workflows of
mlc_llm.build
(legacy) andmlc_chat compile
(SLM) to compile and deploy Llama2 7Bq4f16_1
model on Qualcomm 8gen3 device.With the same input (42 tokens prefilled), the decoding speed diverges between legacy and SLM. SLM is ~10% slower than legacy.
I profiled the GPU usage with https://kite.mi.com/ as follows. It seems that model compiled with SLM (~80%) has a lower GPU usage than legacy (~87%) at decoding stage.
Possibly related to
I assume that #1536 contributes to the faster speed of legacy workflow. Haven't tested
q4f16_0
quantization yet.