Open JehanJaye opened 2 weeks ago
Can you upgrade your version of ONNX Runtime? The GroupQueryAttention
op was updated to support more inputs and ONNX Runtime v1.16.3 does not have that change.
Thanks! Since I am using JetPack 5.1.1 with CUDA 11.4, I couldn't find a pre-compiled newer version than that of onnxruntime-gpu tarball that supports gpu-linux-aarch64.
Is there any other alternative except compiling and building supported onnxruntime-gpu tarball from source? I use that extracted tar as ort_home to build the onnxruntime-genai.
However besides this approach, I was able to find out a compiled gpu aarch64 newer version of onnxruntime as a .whl. when building onnxruntime-genai from source (build.py), instead of giving ort_home source directory to onnxruntime-gpu, is there any other option here?
ONNX Runtime GenAI requires the shared libraries and the C API header file from ONNX Runtime. To get the shared libraries, you can install the .whl
and copy the shared libraries from onnxruntime/capi/
that match the libonnxruntime*.so*
pattern. To get the header file, you can download the header file from include/onnxruntime/core/session/onnxruntime_c_api.h
using an official ONNX Runtime release branch. Official release branches are named as rel-{ORT_VERSION}
.
For example:
.whl
fileFor example, wheels for Jetson appear to be published here.
$ wget https://nvidia.box.com/shared/static/qnm7xtdemybuyog3yzz4qio3ly8fvi6r.whl -O onnxruntime_gpu-1.18.0-cp39-cp39-linux_aarch64.whl
$ pip install onnxruntime_gpu-1.18.0-cp39-cp39-linux_aarch64.whl
$ git clone https://github.com/microsoft/onnxruntime-genai
$ cd onnxruntime-genai
$ mkdir -p ort/include/
$ mkdir -p ort/lib/
.whl
is installedThis example is using onnxruntime-gpu
as the package name to search. Please change this to the package name you installed.
$ pip show onnxruntime-gpu
Name: onnxruntime-gpu
Version: 1.18.0
Summary: ONNX Runtime is a runtime accelerator for Machine Learning models
Home-page: https://onnxruntime.ai
Author: Microsoft Corporation
Author-email: onnxruntime@microsoft.com
License: MIT License
Location: /path/to/.local/lib/python3.9/site-packages
Requires: coloredlogs, flatbuffers, numpy, packaging, protobuf, sympy
Required-by:
ort/lib/
This is using /path/to/.local/lib/python3.9/site-packages
as the example location. Please change this to the location you see in the previous step.
$ cp /path/to/.local/lib/python3.9/site-packages/onnxruntime/capi/libonnxruntime*.so* ort/lib/
ort/include/
This is using rel-1.18.0
as the example since the pip
package example version is 1.18.0
. Please replace 1.18.0
with the version you want to use.
$ cd ort/include/
$ wget https://github.com/microsoft/onnxruntime/blob/rel-1.18.0/include/onnxruntime/core/session/onnxruntime_c_api.h
Please modify the python build.py
command as needed for your build. For more details, please visit here.
$ cd ../../
$ python build.py
Thanks for the very detailed response. Will try this out and update here.
@kunal-vaishnavi Any estimated release date for release Phi-3.5-vision ONNX?
@kunal-vaishnavi Any estimated release date for release Phi-3.5-vision ONNX?
The work is in progress and we are working to complete it soon, but there's no estimated release date because the Phi-3.5 vision ONNX models will need to undergo Microsoft's Responsible AI evaluations before they can be published officially. If the evaluations take a while, I can publish a tutorial once all of the work is merged into ONNX Runtime GenAI so that you can generate your own ONNX models locally and run them.
Describe the bug python3 phi3v.py -m cuda-int4-rtn-block-32 gives the following issue:
Loading model... Traceback (most recent call last): File "phi3v.py", line 66, in <module> run(args) File "phi3v.py", line 16, in run model = og.Model(args.model_path) onnxruntime_genai.onnxruntime_genai.OrtException: Load model from cuda-int4-rtn-block-32/phi-3-v-128k-instruct-text.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/GroupQueryAttention", GroupQueryAttention, "com.microsoft", -1) : ("/model/layers.0/attn/qkv_proj/MatMul/output_0": tensor(float16),"","","past_key_values.0.key": tensor(float16),"past_key_values.0.value": tensor(float16),"/model/attn_mask_reformat/attn_mask_subgraph/Sub/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"cos_cache": tensor(float16),"sin_cache": tensor(float16),) -> ("/model/layers.0/attn/GroupQueryAttention/output_0": tensor(float16),"present.0.key": tensor(float16),"present.0.value": tensor(float16),) , Error Node (/model/layers.0/attn/GroupQueryAttention) has input size 9 not in range [min=7, max=7].
To Reproduce
Quantized Phi-3-vision model in ONNX format on the Jetson ORIN
wget http://jetson.webredirect.org:8000/jp5/cu114/onnxruntime-gpu-1.16.3.tar.gz
mkdir ort
tar -xvf onnxruntime-gpu-1.16.3.tar.gz -C ort
mv ort/include/onnxruntime/onnxruntime_c_api.h ort/include/
rm -rf ort/include/onnxruntime/
python3 build.py --use_cuda --cuda_home /usr/local/cuda-11.4 --skip_tests --skip_csharp --parallel
pip3 install *.whl
pip3 install huggingface-hub[cli]
huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx-cuda --include cuda-int4-rtn-block-32/* --local-dir .
wget https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3v.py
python3 phi3v.py -m cuda-int4-rtn-block-32
JETSON-ORIN
Additional context onnxruntime-genai built from source without encountering any CUDA-related problems. However, when loading the model I get this error related to the model. I would appreciate any assistance in diagnosing and correcting this problem.