Open gidzr opened 1 year ago
Hi @gidzr, can you try using ORTModelForSeq2SeqLM.from_pretrained(..., use_io_binding=True)
to initialize the ORT model?
In Xenova/bart-large-cnn there is a quantized and non-quantized model in the same repo (which is probably not loadable in ORTModel actually @xenova). Which one are you using?
@fxmarty I manually changed the model names the local folder to ensure the additional models aren't detected. I changed the extension of all the optional models by adding XXX to the extensions so they aren't recognised as onnx file mime types. When there are multiple onnx files, I get the error response in CLI points out that there are multiple models.
I was previously using "use_cache":False, "use_io_binding":False, because the -with-past is not being used..
I just tried with "use_cache":False, "use_io_binding":True, but got the following error:
When using CUDAExecutionProvider, the parameters combination use_cache=False, use_io_binding=True is not supported.
Please either pass use_cache=True, use_io_binding=True (default), or use_cache=False, use_io_binding=False.
It doesn't appear that it's possible to set io_binding True unless using the with-past (cache) version of the model.
Hi @gidzr, can you try using
ORTModelForSeq2SeqLM.from_pretrained(..., use_io_binding=True)
to initialize the ORT model?In Xenova/bart-large-cnn there is a quantized and non-quantized model in the same repo (which is probably not loadable in ORTModel actually @xenova). Which one are you using?
Here are the quantization settings I was using (along with which ops were quantized and their weight types): https://huggingface.co/Xenova/bart-large-cnn/blob/main/quantize_config.json.
Hi @xenova
1) apologies for the shocking grammar in my previous explanation. That's a 2am message. 2) just to clarify, I've ensured that only 1 type of model is in the folder when inference is run. I disable the non-quant files by changing their file extensions/mimes. 3) I'm going to try optimise/export and quant onnx again on my localserver to see if the issue arise with ONNX generally, or specifically with the Xenova version and will put my results here.
Chat soon.
Hi @xenova and @fxmarty
I wanted to follow up with some testing, as I've asked a few questions that can be rolled up and wanted to get something of substance back to you.
Based on the results, my only remaining question is:
What could cause the Homebaked onnx to go crazy.. even after adjusting all the Opt and Quant settings available to CLI command?
Inference results produced by Homebaked ONNX quantization is incoherent babbling (comparison results at bottom of page): [{'summary_text': 'The black hole is a space space, a space of space, in the universe, and the space of the universe. In the space, the black holes of the galaxies of the space space. The black holes is a gravitational gravitational energy. The first black hole in the Universe, a black hole, is known to have the energy of a black space. A black hole of the stars, the energy energy of the black space, is a energy.'}]
Couple of extra weird bugs for Xenova:
Incoherent results produced by Homebaked ONNX quantization: [{'summary_text': 'The black hole is a space space, a space of space, in the universe, and the space of the universe. In the space, the black holes of the galaxies of the space space. The black holes is a gravitational gravitational energy. The first black hole in the Universe, a black hole, is known to have the energy of a black space. A black hole of the stars, the energy energy of the black space, is a energy.'}]
VS a good summary, produced by Xenova ONNX: [{'summary_text': 'A black hole is a region of spacetime where gravity is so strong that nothing can escape. The first black hole known was Cygnus X-1, identified by several researchers independently in 1971. The presence of a black hole can be inferred through its interaction with other matter and with electromagnetic radiation..'}]
@gidzr Thanks for the extensive tests! I also had issues with incoherent results with default quantization settings and I fixed it by setting per_channel=True
and reduce_range=True
. I'm not sure if your tests included this, but these should (basically) be the only differences between your and my model.
See my conversion script for more details on the process (which, of course, uses optimum in the background).
@xenova Hey, no worries re testing :) .. thanks heaps for the steer on your script with channel/ranges.
After running various combos, the ultimate winner by knockout, is the Xenova convert.py with per_channel=true and reduce_range=true.
It's not just per_channel that was important, when I ran with reduce_range=false, or using the CLI method which only has the --per-channel option (reduce_range flag not available), all I got was a garbage out.
Also discovered that the onnxruntime 1.14 doesn't handle quantized version opsets from the Xenova script, only 1.16 does. In an unrelated issue, I was told to drop my version of onnxruntime to try quantize Llama2 (https://github.com/huggingface/optimum/issues/1409#issuecomment-1735042118), hence the older version when testing. However, the current results are based on the latest 1.16 version. I'll re-test using 1.15 for Llama
To wrap things up.. What is opset? I've checked via google and bing, I can't find much on opset aside from 19 is experimental and below 9 might be deprecated. Is it an umbrella concept for chipset support (avx512/arm/etc) and optimization strategy (O1-O4), or something completely different? What are the definitions of 1-19 opsets so I can select the most appropriate for my situation?
Many thanks!
CLI with --per_channel "ThereIII I�amam�am�II�I�vamamamII can�amI’namam-I”I�ia of the I can IIamam Iam’I�ima of the time of theIamn’amamThereI� I ThereI canII usually of the space of time of I ready of theamam canI�"
Convert.py with per_channel=true, reduce_range=false "The new year is a good time for a new year to be a good one. The new year can be a great time for the new year. The New Year is a great year for the first year. I'm a little bit of a new Yorker. I was a little of a little girl. I had a little boy. I think that's what the new world is like. I don't know what the world is. is like, but I don’t know what"
Convert.py with per_channel=true, reduce_range=true "A black hole is a region of spacetime where gravity is so strong that nothing can escape. The first black hole known was Cygnus X-1, identified by several researchers independently in 1971. Black holes of stellar mass form when massive stars collapse at the end of their life cycle."
Running CLI method -> with --per_channel and without -> changing O1 to O4, -> changing avx to avx512_vnni
Running convert.py method ->--opset 11, 18, and unset -> --per_channel true --reduce_range true -> --per_channel false--reduce_range true -> --per_channel true --reduce_range false
Very interesting 👍 Good to know it aligns with the default settings for transformers.js models (which I chose based on my own small-scale tests).
cc @pcuenca (re: benchmarking efforts)
Very interesting indeed! Very surprised to see such huge differences with similar looking configurations.
@xenova might be time for the current CLI / inference quantization methods to be retired in favour of the Joshy special!
@xenova, @fxmarty, @pcuenca: Are there any notes, docs, or urls regarding opsets? All I can find is: https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html
Hey @xenova
Came across an issue, that might be related to onnxruntime, or some other restriction related to optimization/quantization.
I downgraded to onnxruntime 1.15, using opset 16 and 17, to quantize Llama based models.
Problems, optimization occurs, but quantization fails / killed. I thought this was due to memory limitations, but this was occurring for optimized models of 3G size, which weren't an issue for other architectures. eg. PY007/TinyLlama-1.1B-Chat-v0.2 will optimize to 3G then fail to quantize. The other Llama models will optimize to 26G or 12G files, for the various different Llama flavours/repos.
Are there any limitations when quantizing Llama models with the Xenova convert.py script?
Cheers
Are there any limitations when quantizing Llama models with the Xenova convert.py script?
It should act in the same way as performing the two steps with optimum separately: (1) conversion, followed by (2) quantization. So, nothing extra in terms of memory usage.
Problems, optimization occurs, but quantization fails / killed. I thought this was due to memory limitations, but this was occurring for optimized models of 3G size
Could you share the id of the model you are trying to convert? Are the weights stored as 4, 8, 16, or 32-bit? I do know that the quantization process is quite memory-hungry, and also depends on some other quantization settings.
@xenova
Sure thing, these are the conditions for the testing:
In all cases, I'm running server with CPU, 8 cores, 32Gram, 580Gssd, 60Gswapram, Debian 11, Apache latest. I also originally tested with onnxruntime 1.15 and opset 17, but the latest onnxruntime 1.16.1 and opset 18 works now with llama, which re-tested with same results as the 1.15/opset17 condition.
Models tested, using both "causal-lm", and "default" settings, with the channel and ranges = true (per above discussion)
'codellama/CodeLlama-7b-Instruct-hf', //CLI quanted to a 6G file 'togethercomputer/Llama-2-7B-32K-Instruct', //CLI quanted to a 6G file 'mosaicml/mpt-7b-instruct', //CLI quanted to a 6G file 'togethercomputer/RedPajama-INCITE-7B-Instruct',//CLI quanted to a 6G file 'VMware/open-llama-7b-v2-open-instruct', //CLI quanted to a 6G file 'mlfoundations/open_lm_1B', //CLI quanted to a 6G file 'PY007/TinyLlama-1.1B-Chat-v0.2', //CLI quanted to a 1G file 'microsoft/DialoGPT-large', //CLI quanted to a <1G file
Cheers
System Info
Who can help?
@philschmid, @michaelbenayoun, @JingyaHuang, @echarlaix
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Hey there, I've been running inference a few different ways to see what's faster, and been getting weird results.
I've been using
-> ORTModelForSeq2SeqLM -> task: "summarization" with pipeline -> model: Xenova/bart-large-cnn, which I've pulled to local drive and running using direct Path(/home/admin/models/...) -> decoder_model_quantized, and encoder_model_quantized.
I've limited max new tokens and max length to 100.
I've been using an excerpt from a science paper on black holes (coz they are cool)... which is 2437 chars, or 502 tokens in length.
Running optimum runtime pipeline using the recommended setup: https://huggingface.co/docs/optimum/v1.3.0/en/onnxruntime/modeling_ort#optimum.onnxruntime.ORTModelForSeq2SeqLM.forward.example-2
Expected behavior
When running the optimum approach per link (above in reproducing steps) this takes 2.5 to 3 minutes to summarize the text.
But using the very basic pipeline from ffacebook/bart-large-cnn, https://huggingface.co/facebook/bart-large-cnn#how-to-use, it only takes 50seconds to summarize the same text.
I'm not sure what's going on.. I was expecting the onnx method to blitz the original.
Is it a mismatch between the task/model/class method? .. or is the Xenova onnx quant done on a different cpu/gpu setup going to make the quant version inappropriate for my specs?
What's holding onnx back vs the original?