microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
418 stars 95 forks source link

onnxruntime_genai.onnxruntime_genai.OrtException when running Phi-3-Vision ONNX model #849

Open JehanJaye opened 2 weeks ago

JehanJaye commented 2 weeks ago

Describe the bug python3 -m cuda-int4-rtn-block-32 gives the following issue:

Loading model... Traceback (most recent call last): File "", line 66, in <module> run(args) File "", line 16, in run model = og.Model(args.model_path) onnxruntime_genai.onnxruntime_genai.OrtException: Load model from cuda-int4-rtn-block-32/phi-3-v-128k-instruct-text.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/GroupQueryAttention", GroupQueryAttention, "", -1) : ("/model/layers.0/attn/qkv_proj/MatMul/output_0": tensor(float16),"","","past_key_values.0.key": tensor(float16),"past_key_values.0.value": tensor(float16),"/model/attn_mask_reformat/attn_mask_subgraph/Sub/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"cos_cache": tensor(float16),"sin_cache": tensor(float16),) -> ("/model/layers.0/attn/GroupQueryAttention/output_0": tensor(float16),"present.0.key": tensor(float16),"present.0.value": tensor(float16),) , Error Node (/model/layers.0/attn/GroupQueryAttention) has input size 9 not in range [min=7, max=7].

To Reproduce

Quantized Phi-3-vision model in ONNX format on the Jetson ORIN

  1. Compile ONNXRuntime for Jetpack5.1.1 with CUDA 11.4 wget mkdir ort tar -xvf onnxruntime-gpu-1.16.3.tar.gz -C ort mv ort/include/onnxruntime/onnxruntime_c_api.h ort/include/ rm -rf ort/include/onnxruntime/
  2. Compiling onnxruntime-genai repository : Switch to 940bc
  3. python3 --use_cuda --cuda_home /usr/local/cuda-11.4 --skip_tests --skip_csharp --parallel
  4. Install the generated wheel pip3 install *.whl
  5. pip3 install huggingface-hub[cli]
  6. Download the Phi-3-vision ONNX model huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx-cuda --include cuda-int4-rtn-block-32/* --local-dir .
  7. Example script wget
  8. Inference python3 -m cuda-int4-rtn-block-32


Additional context onnxruntime-genai built from source without encountering any CUDA-related problems. However, when loading the model I get this error related to the model. I would appreciate any assistance in diagnosing and correcting this problem.

kunal-vaishnavi commented 2 weeks ago

Can you upgrade your version of ONNX Runtime? The GroupQueryAttention op was updated to support more inputs and ONNX Runtime v1.16.3 does not have that change.

JehanJaye commented 2 weeks ago

Thanks! Since I am using JetPack 5.1.1 with CUDA 11.4, I couldn't find a pre-compiled newer version than that of onnxruntime-gpu tarball that supports gpu-linux-aarch64.

Is there any other alternative except compiling and building supported onnxruntime-gpu tarball from source? I use that extracted tar as ort_home to build the onnxruntime-genai.

However besides this approach, I was able to find out a compiled gpu aarch64 newer version of onnxruntime as a .whl. when building onnxruntime-genai from source (, instead of giving ort_home source directory to onnxruntime-gpu, is there any other option here?

kunal-vaishnavi commented 2 weeks ago

ONNX Runtime GenAI requires the shared libraries and the C API header file from ONNX Runtime. To get the shared libraries, you can install the .whl and copy the shared libraries from onnxruntime/capi/ that match the libonnxruntime*.so* pattern. To get the header file, you can download the header file from include/onnxruntime/core/session/onnxruntime_c_api.h using an official ONNX Runtime release branch. Official release branches are named as rel-{ORT_VERSION}.

For example:

1) Download and install the ONNX Runtime .whl file

For example, wheels for Jetson appear to be published here.

$ wget -O onnxruntime_gpu-1.18.0-cp39-cp39-linux_aarch64.whl
$ pip install onnxruntime_gpu-1.18.0-cp39-cp39-linux_aarch64.whl

2) Clone ONNX Runtime GenAI and prepare folders

$ git clone
$ cd onnxruntime-genai
$ mkdir -p ort/include/
$ mkdir -p ort/lib/

3) Find where the .whl is installed

This example is using onnxruntime-gpu as the package name to search. Please change this to the package name you installed.

$ pip show onnxruntime-gpu
Name: onnxruntime-gpu
Version: 1.18.0
Summary: ONNX Runtime is a runtime accelerator for Machine Learning models
Author: Microsoft Corporation
License: MIT License
Location: /path/to/.local/lib/python3.9/site-packages
Requires: coloredlogs, flatbuffers, numpy, packaging, protobuf, sympy

4) Copy shared libraries to ort/lib/

This is using /path/to/.local/lib/python3.9/site-packages as the example location. Please change this to the location you see in the previous step.

$ cp /path/to/.local/lib/python3.9/site-packages/onnxruntime/capi/libonnxruntime*.so* ort/lib/

5) Download C API header file to ort/include/

This is using rel-1.18.0 as the example since the pip package example version is 1.18.0. Please replace 1.18.0 with the version you want to use.

$ cd ort/include/
$ wget

6) Build ONNX Runtime GenAI from source

Please modify the python command as needed for your build. For more details, please visit here.

$ cd ../../
$ python
JehanJaye commented 1 week ago

Thanks for the very detailed response. Will try this out and update here.

@kunal-vaishnavi Any estimated release date for release Phi-3.5-vision ONNX?

kunal-vaishnavi commented 1 week ago

@kunal-vaishnavi Any estimated release date for release Phi-3.5-vision ONNX?

The work is in progress and we are working to complete it soon, but there's no estimated release date because the Phi-3.5 vision ONNX models will need to undergo Microsoft's Responsible AI evaluations before they can be published officially. If the evaluations take a while, I can publish a tutorial once all of the work is merged into ONNX Runtime GenAI so that you can generate your own ONNX models locally and run them.