Open tarekziade opened 2 days ago
Your graph says "tokens per second" not "execution time".
Your graph says int4 has does the most tokens per second.
So your graphs seems to be saying the opposite of what you are saying unless you labelled the axis wrong? 😕
Your graph says "tokens per second" not "execution time".
Yes. That's a way to measure execution time -- or at least "speed" :)
Your graph says int4 has does the most tokens per second.
Correct, for llama-cli it's the highest, 140 tokens/s For onnx, it's the lowest, 4.85 tokens/s
results in JSON : https://github.com/tarekziade/onnxruntime-test/blob/main/results.json
So your graphs seems to be saying the opposite of what you are saying unless you labelled the axis wrong? 😕
I don't think it does, maybe what is confusing is that the graph includes both onnx and llama.cpp results?
I see so the ones labelled "onnx" are the ones you are running in genai and the ones labelled "llama" are the ones running in llama.cpp . Yes sorry I got confused because Llama is also the name of an LLM. Yeah that looks very bad on the Mac. Guessing they haven't optimised it for Mac yet then . 😔 I have tried int4 on Windows DML and when it was partially working (version 0.4.0) it was very fast.
+1 to two of the issues raised. I am getting the exact same error on the fp16 version of my model:
Shape mismatch attempting to re-use buffer. {1,1,3072} != {1,808,3072}. Validate usage of dim_value (values should be > 0) and dim_param (all values with the same string should equate to the same size) in shapes in the model.
Plus I am also confused about the int8 support in the model builder. It seems it is supported to a certain extent:
io_dtype = TensorProto.FLOAT if precision in {"int8", "fp32"}
But similarly I get an error if I actually attempt to use it:
NotImplementedError: The int8 precision is not currently supported.
clarification would be helpful (especially as int8 can be supported through other means such as exporting with GPTQ or using Tensor-RTs Model Optimizer)
I have built a small example using the python binding here https://github.com/tarekziade/onnxruntime-test/blob/main/run.py to measure the inference speed on my Apple M1 and on a windows 11 box, using Qwen 2.5 0.5B instruct
to prepare the model I used the cpu provider and int4/fp16/fp32 precisions:
And compared the execution times with llama-cli using a GGUF of the same model using q4_0
One apple, the int4 precision is extremely slow on and fp16 failed on both platforms with
I was wondering if I did something wrong? I was also wondering if int8 precision is an option. looks like onnxruntime_genai.models.builder can use some int8 quantizations using the int4 mode but I am not entirely clear about this