Enable graph mode for LLM inference

xduzhangjiayu commented 1 month ago

Hi, I have read the "examples\NPU compilation tutorial.ipynb" about graph mode and eager mode, which helped me a lot. I was wondering if I could use graph mode in LLM inference to reduce the weights copying between CPU and NPU. So i simply changed the return value of function horizontal_fusion_linear into return fx_model.to('npu'), after converting the model, the inference error is: AttributeError: 'Tensor' object has no attribute 'is_contiguous' It seems this operation cannot be performed in the NPU？If i want to use graph mode in LLM inference, the above change is correct?

Any comment or advice is appreciated, thanks !

alessandropalla commented 1 month ago

Hi, we are working toward that as well. For example, please look at https://github.com/intel/intel-npu-acceleration-library/pull/84 for a tentative implementation of it for Phi3MLP layer. We are also waiting for OpenVINO remote tensors feature, that would bring almost performance parity between graph and kernel mode

xduzhangjiayu commented 1 month ago

Thanks very much for the reply!

---- Replied Message ---- | From | Alessandro @.> | | Date | 07/11/2024 20:21 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [intel/intel-npu-acceleration-library] Enable graph mode for LLM inference (Issue #89) |

Hi, we are working toward that as well. For example, please look at #84 for a tentative implementation of it for Phi3MLP layer. We are also waiting for OpenVINO remote tensors feature, that would bring almost performance parity between graph and kernel mode

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

xduzhangjiayu commented 1 month ago

By the way，I've just tried the graph mode implementation for TinyLlama-1.1B INT4，only change the "Phi3MLP" to "LlamaMLP" I found that the inference speed was not improved (already done a warm up) and during the inference the NPU memory is up to 4GB，i think was impossible for this case，do you any comment for this？

Finally，thanks a lot for your reply and patience，I've learned a lot from this project !

---- Replied Message ---- | From | Alessandro @.> | | Date | 07/11/2024 20:21 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [intel/intel-npu-acceleration-library] Enable graph mode for LLM inference (Issue #89) |

Hi, we are working toward that as well. For example, please look at #84 for a tentative implementation of it for Phi3MLP layer. We are also waiting for OpenVINO remote tensors feature, that would bring almost performance parity between graph and kernel mode

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

alessandropalla commented 1 month ago

I think it depends on the implementation, we found that using vanilla .to method doesn't produce quantized models with the right acceleration for the NPU and we are working on it. The memory increase is due to this + the fact that MLP are compiled once for the first inference and another time for the n+1 inference. This is because it has different shape. This is why kernel mode is so important for LLM inference and remote tensors, that allows the weights to be already allocated to the NPU are a crucial performance step in that direction

alessandropalla commented 1 month ago

You are welcome, I'm happy to help

xduzhangjiayu commented 1 month ago

Sorry, I don't understand the “different shape” you have mentioned, I think the dimensions and weights of the MLP layer are the same during the inference ? So after the warm up the weights can be already allocated to NPU？

---- Replied Message ---- | From | Alessandro @.> | | Date | 07/11/2024 21:03 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [intel/intel-npu-acceleration-library] Enable graph mode for LLM inference (Issue #89) |

You are welcome, I'm happy to help

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

xduzhangjiayu commented 4 weeks ago

Hi, The current version it seems can optimize separate phi3-MLP layer using to("npu"), i was curious if we can use to("npu") only for MLP layer when inference a entire llm model for speed up? For the implementation of this idea, will there be some limitations？(e.g. OpenVINO backend or NPU hardware)

alessandropalla commented 4 weeks ago

We are working on this by using remote tensors (WIP PR here: https://github.com/intel/intel-npu-acceleration-library/pull/97) That would help in removing all overhead. The end goal would be to use .to('npu') like you do using CUDA to move tensors and models to the NPU

xduzhangjiayu commented 4 weeks ago

That will be great if we can load the entire model into the NPU by using remote tensor, thanks for the reply !

intel / intel-npu-acceleration-library

Enable graph mode for LLM inference #89