[Bug]: Manual Heterogeneous configuration incorrectly changing the shape

ayf7 commented 2 months ago

OpenVINO Version

2024.2.0

Operating System

Other (Please specify in description)

Device used for inference

HETERO

Framework

PyTorch

Model used

Llama3 (w/ custom modifications)

Issue description

I am currently trying to compile a customized Llama3 model heterogeneously between CPU and GPU. More specifically, I am using a modified Llama3-8B model, and trying to push some operations in the last layer onto GPU, with the rest of it CPU.

Here is a snippet of the driver code I am using for the heterogeneous configuration:

    for idx, node in enumerate(llama_backend.get_ops()):
        op = node.get_type_name()
        name = node.get_friendly_name()
        if "layers.31" in name:
            node.get_rt_info()["affinity"] = "GPU"
        else:
            node.get_rt_info()["affinity"] = "CPU"

The model successfully compiles. However, upon inference, this results in the following error listed below [OUTPUT 1].

Using heterogeneous visualization, I was able to visualize the subgraphs. I believe the underlying issue comes from the following: graph_1

(What's interesting is nowhere in my OpenVINO IR format is there a tensor input of [1, 128, 8, 1, 128].)

I suspected that it had something to do with the CPU -> GPU, so I converted those operations above (Unsqueeze) as well as some others in this layer. This time it passed - however, there were more issues further along. [OUTPUT 2]

graph_2

After compilation, I printed out the outputs for the compiled model, and noticed that the output shapes are not correct... it somehow converted to a dynamic tensor/output shape...

OS: Ubuntu 22.04 LTS

XML file (bin not supported) (https://github.com/openvinotoolkit/openvino/assets/113222263/2c6736eb-e221-4370-a1a0-6359b1ba3833) model-xml.zip

Step-by-step reproduction

The file sizes for the IR model (.bin) is too large. I am happy to email the files if requested.

But I've provided the .xml file if you are interested to see the layout.

Relevant log output

===== OUTPUT 1 =====

[GPU] The tensor size is not equal to model, can't set input tensor with index: 5, because model input (shape=[1,128,8,1,128]) and tensor (shape=[0,0,8,1,128]) are incompatible

===== OUTPUT 2 =====

(CPU) Can't set the input tensor with index: 0, because the model input (shape=[1,1,4096]) and the tensor (shape=(0.0.4096)) are incompatible

===== OUTPUT 3 =====
outputs[
<ConstOutput: names[logits] shape[1,1,128256] type: f32>,
<ConstOutput: names[cache_k_0_out] shape[1,128,8,128] type: f32>,
...
<ConstOutput: names[cache_v_28_out] shape[1,128,8,128] type: f32>,
<ConstOutput: names[cache_v_29_out] shape[1,128,8,128] type: f32>,
<ConstOutput: names[cache_v_30_out] shape[1,128,8,128] type: f32>,
<ConstOutput: names[cache_v_31_out] shape[1,1..129,8,128] type: f32>
]>

Issue submission checklist

[X] I'm reporting an issue. It's not a question.
[X] I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
[X] There is reproducer code and related data files such as images, videos, models, etc.

wenjiew commented 2 months ago

@WeldonWangwang Please help check whether there is additional info needed to answer or investigate. Thanks!

WeldonWangwang commented 2 months ago

hi @ayf7 , We are working to reproduce and identify the issue.

BTW, I would like to know if you need to run the model nodes separately on two devices to improve performance? It should be noted that manually setting the affinities of different nodes in LLM may result in dividing the model into multiple subgraphs. Hetero can achieve automatic split the model into two devices when GPU memory cannot fully load the model by using HETERO:GPU,CPU, and setting PIPELLINE_PALLEL for MODEL_DISTRBUTION_POLICY.

ayf7 commented 2 months ago

Hi @WeldonWangwang ,

Thanks for the response. So actually I am doing research on heterogeneous computing, and this is why I am experimenting with the HETERO plugin in particular - looking to see how different operations can be assigned to different devices (in my case, CPU, iGPU, and NPU) to help with parallelization and speedup.

wangleis commented 1 month ago

@ayf7 Could you please try below steps?

Is this custom Llama3 model correct to run on CPU?
Is this custom Llama3 model correct to run on HETERO:GPU,CPU with ov::hint::model_distribution_policy(ov::hint::ModelDistributionPolicy::PIPELINE_PARALLEL) which use HETERO to separate model? If yes, is this option suitable for your use case?
If option 2 is not suitable for your use case, could you please share more detail why HETERO PIPELINE_PARALLEL doesn't work in your use case?

wenjiew commented 1 month ago

@ayf7 BTW, it would be great if you can try the above configuration in OpenVINO 2024.3 release. Thanks!

openvinotoolkit / openvino