[Feature Request] run the LLM model on the Qualcomm Hexagon NPU in Android OS

taeyeonlee commented 7 months ago

🚀 Feature

Hello, Is it possible to run the LLM model (Llama 2 7B Quantized) on the Qualcomm Hexagon NPU in Android OS ? How to run the LLM model on the Qualcomm Hexagon NPU in Android OS ?

Motivation

Qualcomm says that Qualcomm Hexagon NPU performance is up to 98% faster

Alternatives

Additional context

Hzfengsy commented 7 months ago

I tried a bit but failed since Hexagon is not OPEN for developers. To be specific:

32-bit RTOS with 4GB memory limitation. (Qualcomm can use tricks to support more memory, but we cannot)
No public HMX API and Docs
No optimization docs for HVX.

We can leave this issue open, but it would be hard to support.

taeyeonlee commented 7 months ago

@Hzfengsy I'll ask to Qualcomm for the info.

ningpengtao-coder commented 7 months ago

@Hzfengsy @taeyeonlee Would you consider indirect support through Android NNAPI instead of low level API support（android NNAPI will automatically switch between CPU, GPU, and NPU.It is possible that many optimization methods cannot be used.）?.At present, the resources of mobile phone devices are limited and it is necessary to consider full utilization（Run multiple models on the mobile phone, such as LLM, ASR, TTS, etc.）.

Hzfengsy commented 7 months ago

@ningpengtao-coder Thanks for your suggestion. That's a good approach to running models on Android. However, I (as well as the team) do not have extra bandwidth to support NNAPI in TVM and MLC-LLM. Love to see the power of community in this interesting area :)

FdyCN commented 7 months ago

@Hzfengsy im a liitle bit confused, cause TVM does have Hexagon backend codegen， and mlc-llm is based on TVM Unity. So why mlc-llm cannot lowering to hexagon target codes？ Is there anything unsupported on the way of "Relax-->TIR-->hexagon target codes" ？

shifeiwen commented 7 months ago

@FdyCN The problem seems to be that htp backend has many limitations, including the size of memory requested and the speed of memory. However, Qualcomm has promoted in some videos that it can make the 7B model reach 20tok/s. I have made some attempts to run single-layer transformers on qnn htp backend and the time exceeds 100ms. I don’t know how Qualcomm achieved it because in mobile It is a good result to use htp 20/toks on the end because it can properly liberate the gpu

shifeiwen commented 7 months ago

I have tried to implement 1.1b llama in hexagon backend before and it was very slow because I did not use cpu scheduling and only added hvx compilation instructions when compiling llvm, but I think this compilation instruction did not play a role in codegen.

FdyCN commented 7 months ago

I have tried to implement 1.1b llama in hexagon backend before and it was very slow because I did not use cpu scheduling and only added hvx compilation instructions when compiling llvm, but I think this compilation instruction did not play a role in codegen.

@shifeiwen thank you so much for the reply, i haven't tested TVM Hexagon codegen performance. According to your experiment, it seems that hvx auto-tuning cannot get the high performance kernel？ So mlc-llm on HVX-only backend can work but slow？Am i right？

tqchen commented 7 months ago

This is something ideally we would like to enable, and indeed we need to overcome some of the hurdles mentions. We can keep this issue open to see the status, getting things into a runnable state is a good first step

shifeiwen commented 7 months ago

@FdyCN Yes, there are currently some ways to support mlc running in hexagon backend, but I tested it very slowly. Each token of 1.1b llama takes more than 60s (there is a lot of optimization work that I did not use, such as better CPU scheduling, or It is the real use of hvx features) ps: 1.1btinyllama load model takes 10 minutes, and the memory speed of dsp is very slow. I wanted to use some shared memory methods, but it was not completed.

FdyCN commented 7 months ago

@FdyCN Yes, there are currently some ways to support mlc running in hexagon backend, but I tested it very slowly. Each token of 1.1b llama takes more than 60s (there is a lot of optimization work that I did not use, such as better CPU scheduling, or It is the real use of hvx features) ps: 1.1btinyllama load model takes 10 minutes, and the memory speed of dsp is very slow. I wanted to use some shared memory methods, but it was not completed.

@shifeiwen thank you for the reply，your test results are really helpful for me, i think maybe deploy llm model on HVX through TVM is not the best choice currently.
Could you please share your optimization test later? Really appreciate！

hmartinez82 commented 4 months ago

Did the whole situation of public API and Docs change? What about the Neural Processing SDK/AI Engine Direct SDK (they are actually the same SDK download now)

https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk and https://developer.qualcomm.com/software/qualcomm-ai-engine-direct-sdk

MrRace commented 3 months ago

wish+1008611

pro9code commented 3 months ago

anyone interested in doing it, look at https://developer.qualcomm.com/downloads/halide-hvx-training?referrer=node/6116

Yemaoxin commented 1 month ago

This is a good request. If anyone has a better way to use the NPU for LLM inference. Please give me some ideas.

fwtan commented 1 week ago

In case this is of interest, we provide an example for deploying TinyLlaMA-1.1B-Chat on Qualcomm Hexagon NPU (SM8650): https://github.com/saic-fi/MobileQuant/tree/main/capp. However, our solution is pretty ad-hoc compared to MLC-LLM.

mlc-ai / mlc-llm