Hi,
I tried to run the 7B INT4 LLM model on the NPU, but I found that the performance was not very good, only about 2 tokens/s. I found that maybe one of the reasons is that the NPU has been loading weights (such as linear layer weights in function setWeights) during the whole inference process, I am not sure the reason for this is because of the NPU hardware limitations that cannot load all the weights of the model at once?
Do you have any suggestions for LLM inference optimization in NPU?
You are making an excellent point. We are working on NPU driver to enable remote tensors and minimize such data transfers. I'll keep this issue up to date
Hi, I tried to run the 7B INT4 LLM model on the NPU, but I found that the performance was not very good, only about 2 tokens/s. I found that maybe one of the reasons is that the NPU has been loading weights (such as linear layer weights in function
setWeights
) during the whole inference process, I am not sure the reason for this is because of the NPU hardware limitations that cannot load all the weights of the model at once? Do you have any suggestions for LLM inference optimization in NPU?Any comment or advice is appreciated, thanks !