Open ayf7 opened 2 months ago
@WeldonWangwang Please help check whether there is additional info needed to answer or investigate. Thanks!
hi @ayf7 , We are working to reproduce and identify the issue.
BTW, I would like to know if you need to run the model nodes separately on two devices to improve performance? It should be noted that manually setting the affinities of different nodes in LLM may result in dividing the model into multiple subgraphs. Hetero can achieve automatic split the model into two devices when GPU memory cannot fully load the model by using HETERO:GPU,CPU, and setting PIPELLINE_PALLEL for MODEL_DISTRBUTION_POLICY.
Hi @WeldonWangwang ,
Thanks for the response. So actually I am doing research on heterogeneous computing, and this is why I am experimenting with the HETERO plugin in particular - looking to see how different operations can be assigned to different devices (in my case, CPU, iGPU, and NPU) to help with parallelization and speedup.
@ayf7 Could you please try below steps?
ov::hint::model_distribution_policy(ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL)
which use HETERO to separate model? If yes, is this option suitable for your use case?@ayf7 BTW, it would be great if you can try the above configuration in OpenVINO 2024.3 release. Thanks!
OpenVINO Version
2024.2.0
Operating System
Other (Please specify in description)
Device used for inference
HETERO
Framework
PyTorch
Model used
Llama3 (w/ custom modifications)
Issue description
I am currently trying to compile a customized Llama3 model heterogeneously between CPU and GPU. More specifically, I am using a modified Llama3-8B model, and trying to push some operations in the last layer onto GPU, with the rest of it CPU.
Here is a snippet of the driver code I am using for the heterogeneous configuration:
The model successfully compiles. However, upon inference, this results in the following error listed below [OUTPUT 1].
Using heterogeneous visualization, I was able to visualize the subgraphs. I believe the underlying issue comes from the following:
(What's interesting is nowhere in my OpenVINO IR format is there a tensor input of [1, 128, 8, 1, 128].)
I suspected that it had something to do with the CPU -> GPU, so I converted those operations above (Unsqueeze) as well as some others in this layer. This time it passed - however, there were more issues further along. [OUTPUT 2]
After compilation, I printed out the outputs for the compiled model, and noticed that the output shapes are not correct... it somehow converted to a dynamic tensor/output shape...
OS: Ubuntu 22.04 LTS
XML file (bin not supported) (https://github.com/openvinotoolkit/openvino/assets/113222263/2c6736eb-e221-4370-a1a0-6359b1ba3833) model-xml.zip
Step-by-step reproduction
The file sizes for the IR model (.bin) is too large. I am happy to email the files if requested.
But I've provided the .xml file if you are interested to see the layout.
Relevant log output
Issue submission checklist