[Docs]: NPU Plugin high level design diagram

junruizh2021 commented 3 days ago

Documentation link

https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/README.md

Description

I have some questions about the high-level architecture diagram in the README.md that shows the OpenVINO NPU design:

Why doesn't the compilation part on the left side of the architecture diagram call Level Zero interfaces? Can the compiled model be executed directly by the NPU driver?

I think it should use Level Zero interfaces to load pre-compiled models, similar to the execution part on the right side.

Do the left and right sides of the architecture diagram represent compilation and execution steps respectively?

In reality, compilation and execution sometimes operate sequentially. OpenVINO NPU can load OpenVINO IR models, compile them and pass them to the NPU driver for execution, or it can directly load pre-compiled blob models. I noticed that Level Zero's ze_graph can load pre-compiled models - is this one of the messages that the architecture diagram is trying to convey?

Based on the code provided, we can see that ze_graph supports loading pre-compiled models through the ZE_GRAPH_FORMAT_NATIVE format:

typedef enum _ze_graph_format_t
{
    ZE_GRAPH_FORMAT_NATIVE = 0x1,                   ///< Format is pre-compiled blob (elf, flatbuffers)
    ZE_GRAPH_FORMAT_NGRAPH_LITE = 0x2,              ///< Format is ngraph lite IR

} ze_graph_format_t;

And the graph descriptor allows loading both pre-compiled blobs and IR models:

typedef struct _ze_graph_desc_t
{
    ze_structure_type_graph_ext_t stype;            ///< [in] type of this structure
    void* pNext;                                    ///< [in,out][optional] must be null or a pointer to an extension-specific
    ze_graph_format_t format;                       ///< [in] Graph format passed in with input
    size_t inputSize;                               ///< [in] Size of input buffer in bytes
    const uint8_t* pInput;                          ///< [in] Pointer to input buffer
    const char* pBuildFlags;                        ///< [in][optional] Null terminated string containing build flags. Options:
                                                    ///< - '--inputs_precisions="<arg>:<precision> <arg2>:<precision> ..."'
                                                    ///<   '--outputs_precisions="<arg>:<precision> <arg2>:<precision> ..."'
                                                    ///<   - Set input and output arguments precision. Supported precisions:
                                                    ///<     FP64, FP32, FP16, BF16, U64, U32, U16, U8, U4, I64, I32, I16, I8, I4, BIN
                                                    ///< - '--inputs_layouts="<arg>:<layout> <arg2>:<layout> ..."'
                                                    ///<   '--outputs_layouts="<arg>:<layout> <arg2>:<layout> ..."'
                                                    ///<   - Set input and output arguments layout. Supported layouts:
                                                    ///<     NCHW, NHWC, NCDHW, NDHWC, OIHW, C, CHW, HW, NC, CN
                                                    ///< - '--config PARAM="VALUE" PARAM2="VALUE" ...'
                                                    ///<   - compile options string passed directly to compiler
} ze_graph_desc_t;

This suggests that Level Zero provides interfaces for both compilation and execution phases, though the architecture diagram may be simplifying the relationship between these components.

Issue submission checklist

[X] I'm reporting a documentation issue. It's not a question.

mlyashko commented 3 days ago

@pereanub ,could you please comment?

PatrikStepan commented 2 days ago

Hello! You are correct, the CompilerAdapter also uses level-zero API and level-zero graph extension API to interact with the driver: .

As you also found, CompilerAdapter would use pfnCreate2 with ZE_GRAPH_FORMAT_NGRAPH_LITE when compiling a model, and pfnCreate2 with the ZE_GRAPH_FORMAT_NATIVE when importing a precompiled model.

The confusion in the diagram is caused by the name of our backend (LevelZero). This is the plugin component that binds an OpenVINO infer request to level-zero primitives like command queue and command lists and executes the model on the device using these primitives.

Historically, NPU plugin supported multiple backends. Among all the others, the one capable of interacting with a level-zero driver was called "LevelZero". Since we currently support only level-zero drivers we could simplify these naming in the future and update the diagram as well. We will try to avoid such confusions in the future. Thank you for your feedback!

junruizh2021 commented 2 days ago

@PatrikStepan Thanks so much for your reply. This means that if I run blob format files directly with OpenVINO + NPU plugin, such as blob file in Intel/sd-1.5-controlnet-scribble-quantized, it can run directly. If I'm using OpenVINO IR model files, then the NPU compiler needs to perform serialization and deserialization operations.

Is this interpretation correct?

Additionally, I have two questions to verify with you:

Are the prebuild ELF files in the NPU plugin open source? It seems to contain some non-linear operators.
Does the blob file generated by the NPU driver directly include the ELF files?

PatrikStepan commented 1 day ago

Yes, your interpretation is correct. When you use ( import) a precompiled model (blob) it can be parsed and executed by the driver directly. When you use an IR the flow is the following:

OpenVINO core will read the IR and generate an ov::model
NPU Plugin will serialize the ov::model into an in-memory IR and pass this buffer to the driver.
Driver will pass the buffer to the compiler
Compiler ( VCL) deserializes the in-memory IR into an ov:model again and generates the blob
Driver parses the blob ( same as on the import path) and is ready to execute inferences for this model Serialization/deserialization cannot be avoided because the compiler in driver is built against a different version of OpenVINO than the plugin. This is how plugin maintains compatibility with multiple driver versions.

Are the prebuild ELF files in the NPU plugin open source? It seems to contain some non-linear operators. https://github.com/openvinotoolkit/npu_plugin/tree/develop is a public snapshot of the NPU Compiler, not of the NPU plugin. Yes, the name is extremely confusing only because the same repository contained the real plugin source code as well. That repository will soon be renamed to npu_compiler. But those SW kernels are part of the compiler ( thus the driver) , not plugin.

The blob file generated by the NPU driver includes the prebuilt kernels used by that model only. The compiler library released inside the driver contains all prebuilt kernels.

junruizh2021 commented 1 day ago

@PatrikStepan Clear explanation. You mean the ELF file will definitely be included in the blob file generated by the compiler.

But as a SW_kernel file, the ELF file can only be pre-built into the compiler by the npu compiler developers, right?

If I, as a user or third-party developer, need to add a new SW_kernel to the npu compiler, is there a way to do this?

PatrikStepan commented 13 hours ago

But as a SW_kernel file, the ELF file can only be pre-built into the compiler by the npu compiler developers, right? Correct. While we can create public snapshots of our compiler, we still need internal NPU tools to build SW kernels. And unfortunately we are not ready to publish those tools in opensource. This is one of the reasons why those kernels were published as prebuilt binaries. There is no current way for you to add a new SW kernel to the compiler unfortunately.

openvinotoolkit / openvino