mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.22k stars 1.58k forks source link

[Question][Android] Lower speed and GPU usage with SLM than legacy workflow on Android Adreno GPU #1896

Closed sbwww closed 8 months ago

sbwww commented 8 months ago

❓ General Questions

I tried both workflows of mlc_llm.build (legacy) and mlc_chat compile (SLM) to compile and deploy Llama2 7B q4f16_1 model on Qualcomm 8gen3 device.

With the same input (42 tokens prefilled), the decoding speed diverges between legacy and SLM. SLM is ~10% slower than legacy.

I profiled the GPU usage with https://kite.mi.com/ as follows. It seems that model compiled with SLM (~80%) has a lower GPU usage than legacy (~87%) at decoding stage.

legacy SLM
decode 425 tokens, 10.1 t/s decode 254 tokens, 9.0 t/s
image image

Possibly related to

I assume that #1536 contributes to the faster speed of legacy workflow. Haven't tested q4f16_0 quantization yet.

sbwww commented 8 months ago

Haven't tested q4f16_0 quantization yet.

Similar pattern found in q4f16_0 quantization. SLM GPU usage ~83%, legacy GPU usage ~89%

Notably, SLM barely reaches 6.5 token/s with q4f16_0, and legacy is 11.5 token/s (prefill 42 tokens + decode 300 tokens)


Might need double-check before deprecating legacy workflow #1886

neobaud commented 8 months ago

I can confirm that I experience this also.

spectrometerHBH commented 8 months ago

SLM flow is using the Paged Attention kernel, which causes the perf regression since it's not tuned on Android.

https://github.com/mlc-ai/mlc-llm/pull/1915

tqchen commented 8 months ago

The attention regression should be fixed by #1915 . In the meantime, given all the known gaps are ready, we will proceed with the deprecation, so followup steps can move forward @srkreddy1238 would be great if you can also help bring some of the q4f16_0 optimizations to the new flow

tqchen commented 8 months ago

Thank you @sbwww for reporting and love to continue work together improving the new flow

srkreddy1238 commented 7 months ago

We find q4f16_0 being more convenient for Adreno (though we tried improving q4f16_1 initially). Now q4f16_0 compatible dlight schedules (GEMV, MatMul) are improved. We fell shot of earlier performance with PagedAttn regressing. Let me give a try with https://github.com/mlc-ai/mlc-llm/pull/1915.

I feel the heat with somany options here, will start giving out the internal optimizations for Adreno soon

tqchen commented 7 months ago

Love to see these land @srkreddy1238 ! i know there can be a slight setbacks due to migration, but hopefully the new engine would offer path towards more useful things like speculation and more

srkreddy1238 commented 6 months ago

Here the PR's for Adreno improvements with SLM flow.

https://github.com/mlc-ai/mlc-llm/pull/2215 : Enable OpenCL Host Ptr usage for Android builds https://github.com/mlc-ai/mlc-llm/pull/2214 : Restored cli utility to work with Android targets (no python here) https://github.com/mlc-ai/mlc-llm/pull/2216 : Thread limit update for Adreno OpenCL https://github.com/apache/tvm/pull/16929 : Enable HostPtr (memory mappe) based data copy https://github.com/mlc-ai/relax/pull/319 : The schedule improvements for q4f16_0 schema

All these changes put together can push the decode performance up to 40% on Snapdragon Gen 3 from the current baseline.

Request to review