onnxruntime_genai.onnxruntime_genai.OrtException when running Phi-3-Vision ONNX model

JehanJaye commented 2 weeks ago

Describe the bug python3 phi3v.py -m cuda-int4-rtn-block-32 gives the following issue:

Loading model... Traceback (most recent call last): File "phi3v.py", line 66, in <module> run(args) File "phi3v.py", line 16, in run model = og.Model(args.model_path) onnxruntime_genai.onnxruntime_genai.OrtException: Load model from cuda-int4-rtn-block-32/phi-3-v-128k-instruct-text.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/GroupQueryAttention", GroupQueryAttention, "com.microsoft", -1) : ("/model/layers.0/attn/qkv_proj/MatMul/output_0": tensor(float16),"","","past_key_values.0.key": tensor(float16),"past_key_values.0.value": tensor(float16),"/model/attn_mask_reformat/attn_mask_subgraph/Sub/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"cos_cache": tensor(float16),"sin_cache": tensor(float16),) -> ("/model/layers.0/attn/GroupQueryAttention/output_0": tensor(float16),"present.0.key": tensor(float16),"present.0.value": tensor(float16),) , Error Node (/model/layers.0/attn/GroupQueryAttention) has input size 9 not in range [min=7, max=7].

To Reproduce

Quantized Phi-3-vision model in ONNX format on the Jetson ORIN

Compile ONNXRuntime for Jetpack5.1.1 with CUDA 11.4 wget http://jetson.webredirect.org:8000/jp5/cu114/onnxruntime-gpu-1.16.3.tar.gz mkdir ort tar -xvf onnxruntime-gpu-1.16.3.tar.gz -C ort mv ort/include/onnxruntime/onnxruntime_c_api.h ort/include/ rm -rf ort/include/onnxruntime/
Compiling onnxruntime-genai repository : Switch to 940bc
python3 build.py --use_cuda --cuda_home /usr/local/cuda-11.4 --skip_tests --skip_csharp --parallel
Install the generated wheel pip3 install *.whl
pip3 install huggingface-hub[cli]
Download the Phi-3-vision ONNX model huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx-cuda --include cuda-int4-rtn-block-32/* --local-dir .
Example script wget https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3v.py
Inference python3 phi3v.py -m cuda-int4-rtn-block-32

JETSON-ORIN

NVIDIA Jetson AGX Orin 64GB
Ubuntu 20.04 Focal Fossa
CUDA: 11.4.315
cuDNN: 8.6.0.166
TensorRT: 8.5.2.2
VPI: 2.2.7
Vulkan: 1.3.204
Jetpack 5.1.1

Additional context onnxruntime-genai built from source without encountering any CUDA-related problems. However, when loading the model I get this error related to the model. I would appreciate any assistance in diagnosing and correcting this problem.

kunal-vaishnavi commented 2 weeks ago

Can you upgrade your version of ONNX Runtime? The GroupQueryAttention op was updated to support more inputs and ONNX Runtime v1.16.3 does not have that change.

JehanJaye commented 2 weeks ago

Thanks! Since I am using JetPack 5.1.1 with CUDA 11.4, I couldn't find a pre-compiled newer version than that of onnxruntime-gpu tarball that supports gpu-linux-aarch64.

Is there any other alternative except compiling and building supported onnxruntime-gpu tarball from source? I use that extracted tar as ort_home to build the onnxruntime-genai.

However besides this approach, I was able to find out a compiled gpu aarch64 newer version of onnxruntime as a .whl. when building onnxruntime-genai from source (build.py), instead of giving ort_home source directory to onnxruntime-gpu, is there any other option here?

kunal-vaishnavi commented 2 weeks ago

ONNX Runtime GenAI requires the shared libraries and the C API header file from ONNX Runtime. To get the shared libraries, you can install the .whl and copy the shared libraries from onnxruntime/capi/ that match the libonnxruntime*.so* pattern. To get the header file, you can download the header file from include/onnxruntime/core/session/onnxruntime_c_api.h using an official ONNX Runtime release branch. Official release branches are named as rel-{ORT_VERSION}.

For example:

1) Download and install the ONNX Runtime `.whl` file

For example, wheels for Jetson appear to be published here.

$ wget https://nvidia.box.com/shared/static/qnm7xtdemybuyog3yzz4qio3ly8fvi6r.whl -O onnxruntime_gpu-1.18.0-cp39-cp39-linux_aarch64.whl
$ pip install onnxruntime_gpu-1.18.0-cp39-cp39-linux_aarch64.whl

2) Clone ONNX Runtime GenAI and prepare folders

$ git clone https://github.com/microsoft/onnxruntime-genai
$ cd onnxruntime-genai
$ mkdir -p ort/include/
$ mkdir -p ort/lib/

3) Find where the `.whl` is installed

This example is using onnxruntime-gpu as the package name to search. Please change this to the package name you installed.

$ pip show onnxruntime-gpu
Name: onnxruntime-gpu
Version: 1.18.0
Summary: ONNX Runtime is a runtime accelerator for Machine Learning models
Home-page: https://onnxruntime.ai
Author: Microsoft Corporation
Author-email: onnxruntime@microsoft.com
License: MIT License
Location: /path/to/.local/lib/python3.9/site-packages
Requires: coloredlogs, flatbuffers, numpy, packaging, protobuf, sympy
Required-by:

4) Copy shared libraries to `ort/lib/`

This is using /path/to/.local/lib/python3.9/site-packages as the example location. Please change this to the location you see in the previous step.

$ cp /path/to/.local/lib/python3.9/site-packages/onnxruntime/capi/libonnxruntime*.so* ort/lib/

5) Download C API header file to `ort/include/`

This is using rel-1.18.0 as the example since the pip package example version is 1.18.0. Please replace 1.18.0 with the version you want to use.

$ cd ort/include/
$ wget https://github.com/microsoft/onnxruntime/blob/rel-1.18.0/include/onnxruntime/core/session/onnxruntime_c_api.h

6) Build ONNX Runtime GenAI from source

Please modify the python build.py command as needed for your build. For more details, please visit here.

$ cd ../../
$ python build.py

JehanJaye commented 1 week ago

Thanks for the very detailed response. Will try this out and update here.

@kunal-vaishnavi Any estimated release date for release Phi-3.5-vision ONNX?

kunal-vaishnavi commented 1 week ago

@kunal-vaishnavi Any estimated release date for release Phi-3.5-vision ONNX?

The work is in progress and we are working to complete it soon, but there's no estimated release date because the Phi-3.5 vision ONNX models will need to undergo Microsoft's Responsible AI evaluations before they can be published officially. If the evaluations take a while, I can publish a tutorial once all of the work is merged into ONNX Runtime GenAI so that you can generate your own ONNX models locally and run them.

microsoft / onnxruntime-genai