mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
17.87k stars 1.42k forks source link

[Feature Request] Do you have any plan to support CPU backend on Android devices? #1106

Open xukui1203 opened 9 months ago

xukui1203 commented 9 months ago

🚀 Feature

I know there is OpenCL backend on Android platform. But for many Android Devices, the GPU has been used by other subsystem like display. So we need to use CPU to run LLM.

tqchen commented 9 months ago

As of now our focus has been on GPU and possibly NPU.

CPU can in theory be supported as TVM have cpu backends, so we also welcome contributions to try that direction

xukui1203 commented 9 months ago

Thanks for your reply. I tried it , the cpu backends can work on Android platform. But it is really very slow. Now we are trying to use ACL to see whether it can get a better performance.

Nick-infinity commented 8 months ago

What is the perf that your are getting on TVM CPU & TVM GPU backend. If you Arm Compute Library implementaion is ready, can you please share its perf as well.

xukui1203 commented 8 months ago

For vicuana 7B, it is about 8toks/s on GPU, but needs 50s to decode 1 toks on CPU. We still can not bringup Arm Compute Library.

FabianSchuetze commented 8 months ago

This is a great question indeed. Also, thanks for this wonderful repo.

Do you know where how I can choose between the accelerators being used, @tqchen ? I try to follow the code, but could not see where NPU or GPU are being selected as accelerators.

junrushao commented 8 months ago

We are using GPUs on Android. CPUs, as indicated in this thread, are likely too slow to support an LLM meaningfully.

FabianSchuetze commented 8 months ago

Thanks, @junrushao for the comment. What about other accelerators than GPUs? NPUs or DSPs come to mind.

xukui1203 commented 8 months ago

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

liangzelang commented 3 months ago

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?

xukui1203 commented 3 months ago

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?

We used MLC-LLM to run LLM on Hexagon DSP. The performance is better now. It is easy to run LLM on CPU, but I am sorry I did not save the patch after the test.

liangzelang commented 2 months ago

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?

We used MLC-LLM to run LLM on Hexagon DSP. The performance is better now. It is easy to run LLM on CPU, but I am sorry I did not save the patch after the test.

ok, thx.

junwenZhang commented 5 days ago

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?

We used MLC-LLM to run LLM on Hexagon DSP. The performance is better now. It is easy to run LLM on CPU, but I am sorry I did not save the patch after the test.

I have great interest in your research about CPU/DSP, Could you share test code, technical articles, etc.?

Yemaoxin commented 5 days ago

MLLM used the Hexagon NPU to achieve 1000 tokens/sec prefill. [https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj0tbS93bCHAxVNoK8BHUrnFoYQFnoECBEQAQ&url=https%3A%2F%2Fgithub.com%2FUbiquitousLearning%2Fmllm&usg=AOvVaw3aGyCoeqNrvgCRUeTxNNK2&opi=89978449](). Maybe it can work.