microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
523 stars 130 forks source link

`onnxruntime-genai` generation speed very slow on int4 #1098

Open tarekziade opened 2 days ago

tarekziade commented 2 days ago

I have built a small example using the python binding here https://github.com/tarekziade/onnxruntime-test/blob/main/run.py to measure the inference speed on my Apple M1 and on a windows 11 box, using Qwen 2.5 0.5B instruct

to prepare the model I used the cpu provider and int4/fp16/fp32 precisions:

python3 -m onnxruntime_genai.models.builder -m "Qwen/Qwen2.5-0.5B-Instruct" -o qwen -p int4 -e cpu
python3 -m onnxruntime_genai.models.builder -m "Qwen/Qwen2.5-0.5B-Instruct" -o qwen -p fp32 -e cpu
python3 -m onnxruntime_genai.models.builder -m "Qwen/Qwen2.5-0.5B-Instruct" -o qwen -p fp16 -e cpu

And compared the execution times with llama-cli using a GGUF of the same model using q4_0

Apple M1 Windows 11

One apple, the int4 precision is extremely slow on and fp16 failed on both platforms with

onnxruntime_genai.onnxruntime_genai.OrtException: 
Non-zero status code returned while running Cast node.
Name:'InsertedPrecisionFreeCast_/model/layers.1/attn/v_proj/repeat_kv/Reshape_4/output_0' Status 
Message: /Users/runner/work/1/s/onnxruntime/core/framework/op_kernel.cc:83 virtual OrtValue *onnxruntime::OpKernelContext::OutputMLValue(int, const onnxruntime::TensorShape &) status.IsOK() was false. 
Shape mismatch attempting to re-use buffer. {1,1,896} != {1,248,896}. 
Validate usage of dim_value (values should be > 0) and dim_param (all values with the same string should equate to the same size) in shapes in the model.

I was wondering if I did something wrong? I was also wondering if int8 precision is an option. looks like onnxruntime_genai.models.builder can use some int8 quantizations using the int4 mode but I am not entirely clear about this

elephantpanda commented 2 days ago

Your graph says "tokens per second" not "execution time".

Your graph says int4 has does the most tokens per second.

So your graphs seems to be saying the opposite of what you are saying unless you labelled the axis wrong? 😕

tarekziade commented 1 day ago

Your graph says "tokens per second" not "execution time".

Yes. That's a way to measure execution time -- or at least "speed" :)

Your graph says int4 has does the most tokens per second.

Correct, for llama-cli it's the highest, 140 tokens/s For onnx, it's the lowest, 4.85 tokens/s

results in JSON : https://github.com/tarekziade/onnxruntime-test/blob/main/results.json

So your graphs seems to be saying the opposite of what you are saying unless you labelled the axis wrong? 😕

I don't think it does, maybe what is confusing is that the graph includes both onnx and llama.cpp results?

elephantpanda commented 1 day ago

I see so the ones labelled "onnx" are the ones you are running in genai and the ones labelled "llama" are the ones running in llama.cpp . Yes sorry I got confused because Llama is also the name of an LLM. Yeah that looks very bad on the Mac. Guessing they haven't optimised it for Mac yet then . 😔 I have tried int4 on Windows DML and when it was partially working (version 0.4.0) it was very fast.

ambroser53 commented 8 hours ago

+1 to two of the issues raised. I am getting the exact same error on the fp16 version of my model:

Shape mismatch attempting to re-use buffer. {1,1,3072} != {1,808,3072}. Validate usage of dim_value (values should be > 0) and dim_param (all values with the same string should equate to the same size) in shapes in the model.

Plus I am also confused about the int8 support in the model builder. It seems it is supported to a certain extent:

io_dtype = TensorProto.FLOAT if precision in {"int8", "fp32"}

But similarly I get an error if I actually attempt to use it:

NotImplementedError: The int8 precision is not currently supported.

clarification would be helpful (especially as int8 can be supported through other means such as exporting with GPTQ or using Tensor-RTs Model Optimizer)