Strangely longer for CPU inference using ONNX than the original PY repo

gidzr commented 1 year ago

System Info

Debian 11 on CPU, Python3.10
optimum : 1.13.1
onnx : 1.14.1
onnxruntime : 1.15.1

Who can help?

@philschmid, @michaelbenayoun, @JingyaHuang, @echarlaix

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Hey there, I've been running inference a few different ways to see what's faster, and been getting weird results.

I've been using
-> ORTModelForSeq2SeqLM -> task: "summarization" with pipeline -> model: Xenova/bart-large-cnn, which I've pulled to local drive and running using direct Path(/home/admin/models/...) -> decoder_model_quantized, and encoder_model_quantized.

I've limited max new tokens and max length to 100.

I've been using an excerpt from a science paper on black holes (coz they are cool)... which is 2437 chars, or 502 tokens in length.

Running optimum runtime pipeline using the recommended setup: https://huggingface.co/docs/optimum/v1.3.0/en/onnxruntime/modeling_ort#optimum.onnxruntime.ORTModelForSeq2SeqLM.forward.example-2

Expected behavior

When running the optimum approach per link (above in reproducing steps) this takes 2.5 to 3 minutes to summarize the text.

But using the very basic pipeline from ffacebook/bart-large-cnn, https://huggingface.co/facebook/bart-large-cnn#how-to-use, it only takes 50seconds to summarize the same text.

I'm not sure what's going on.. I was expecting the onnx method to blitz the original.

Is it a mismatch between the task/model/class method? .. or is the Xenova onnx quant done on a different cpu/gpu setup going to make the quant version inappropriate for my specs?

What's holding onnx back vs the original?

fxmarty commented 1 year ago

Hi @gidzr, can you try using ORTModelForSeq2SeqLM.from_pretrained(..., use_io_binding=True) to initialize the ORT model?

In Xenova/bart-large-cnn there is a quantized and non-quantized model in the same repo (which is probably not loadable in ORTModel actually @xenova). Which one are you using?

gidzr commented 1 year ago

@fxmarty I manually changed the model names the local folder to ensure the additional models aren't detected. I changed the extension of all the optional models by adding XXX to the extensions so they aren't recognised as onnx file mime types. When there are multiple onnx files, I get the error response in CLI points out that there are multiple models.

I was previously using "use_cache":False, "use_io_binding":False, because the -with-past is not being used..

I just tried with "use_cache":False, "use_io_binding":True, but got the following error:

   When using CUDAExecutionProvider, the parameters combination use_cache=False, use_io_binding=True is not supported. 
   Please either pass use_cache=True, use_io_binding=True (default), or use_cache=False, use_io_binding=False.

It doesn't appear that it's possible to set io_binding True unless using the with-past (cache) version of the model.

xenova commented 1 year ago

Hi @gidzr, can you try using ORTModelForSeq2SeqLM.from_pretrained(..., use_io_binding=True) to initialize the ORT model?

In Xenova/bart-large-cnn there is a quantized and non-quantized model in the same repo (which is probably not loadable in ORTModel actually @xenova). Which one are you using?

Here are the quantization settings I was using (along with which ops were quantized and their weight types): https://huggingface.co/Xenova/bart-large-cnn/blob/main/quantize_config.json.

gidzr commented 1 year ago

Hi @xenova

1) apologies for the shocking grammar in my previous explanation. That's a 2am message. 2) just to clarify, I've ensured that only 1 type of model is in the folder when inference is run. I disable the non-quant files by changing their file extensions/mimes. 3) I'm going to try optimise/export and quant onnx again on my localserver to see if the issue arise with ONNX generally, or specifically with the Xenova version and will put my results here.

Chat soon.

gidzr commented 1 year ago

Hi @xenova and @fxmarty

I wanted to follow up with some testing, as I've asked a few questions that can be rolled up and wanted to get something of substance back to you.

Inference Environment: is a server with 8GRam, 2CPU, 30G swapram, Debian 11, apache server..
Quantization Environment: same OS/python/versions as the Inference Environment, but with 32G, 8CPU
Models Tested: facebook/bart-large-cnn, xenova/bart-large-cnn, and "homebaked" onnx of the facebook repo
Input: Summarizing the same halfpage of text (intro section to https://en.wikipedia.org/wiki/Black_hole), keeping the same decoding strategies across all tests by copying Xenova config files and settings as a baseline eg. for beams, ngram limits, early stopping, use_cache, etc

Based on the results, my only remaining question is:
What could cause the Homebaked onnx to go crazy.. even after adjusting all the Opt and Quant settings available to CLI command?

Inference results produced by Homebaked ONNX quantization is incoherent babbling (comparison results at bottom of page): [{'summary_text': 'The black hole is a space space, a space of space, in the universe, and the space of the universe. In the space, the black holes of the galaxies of the space space. The black holes is a gravitational gravitational energy. The first black hole in the Universe, a black hole, is known to have the energy of a black space. A black hole of the stars, the energy energy of the black space, is a energy.'}]

Couple of extra weird bugs for Xenova:

running Xenova from repo via the ORT pipeline
running Xenova from repo via the ORTModel/AutoToken method

Observations:

Xenova ONNX was much faster, but had to run locally as repo inference had issues, with no gross speed differences when running via ORT pipeline vs ORTModelFor.
Optimum runtime CLI: Homebaking optimization/quantization to "summarization" task, didn't produce decoder_with_past. I was only able to obtain with_past files using auto-detect (no task) or text2text-generation-with-past. Which is a little weird because how do you optimize for summarization AND keep use_cache=True?
Homebaked quantized onnx files (via the CLI command) produced incoherent summaries even after replacing all the config files .json files with the Xenova working .jsons. I've tried optimizations O1-O4, then --arm, then --avx2, then --avx512, and finally with --avx512_vnni. All produce the same results incoherent babble.

Incoherent results produced by Homebaked ONNX quantization: [{'summary_text': 'The black hole is a space space, a space of space, in the universe, and the space of the universe. In the space, the black holes of the galaxies of the space space. The black holes is a gravitational gravitational energy. The first black hole in the Universe, a black hole, is known to have the energy of a black space. A black hole of the stars, the energy energy of the black space, is a energy.'}]

VS a good summary, produced by Xenova ONNX: [{'summary_text': 'A black hole is a region of spacetime where gravity is so strong that nothing can escape. The first black hole known was Cygnus X-1, identified by several researchers independently in 1971. The presence of a black hole can be inferred through its interaction with other matter and with electromagnetic radiation..'}]

TESTING Results

A. Model-Tokenize (not pipeline) Inference

Baseline - Local pytorch: 28 seconds with Accelerate model, to(device=cpu); summary was coherent
Repo: "Xenova/bart-large-cnn" using ORTModelFor..: LOTS of junk warning or issues with quantization? ... lots of these: 2023-09-26 05:56:04.176734927 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.6/self_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model. ...
LOCAL: "/home/xxx/xenova_bart-large-cnn" using ORT pipeline: 20 seconds, summary was coherent
LOCAL: Homebaked onnx quant facebook-bart-larg-cnn: 24 seconds, summary was RANDOM mess.

B. Pipeline Inference

Baseline - Local pytorch: 25 seconds with Accelerate ; summary was coherent
Repo: "Xenova/bart-large-cnn" using ORT pipeline: ERRORS Repo id must use alphanumeric chars or '-', '_', '.', '--' and...
LOCAL: "/home/xxx/xenova_bart-large-cnn" using ORT pipeline: 20 seconds, summary was coherent
LOCAL: Homebaked onnx quant facebook-bart-larg-cnn: 24 seconds, summary was RANDOM mess.

xenova commented 1 year ago

@gidzr Thanks for the extensive tests! I also had issues with incoherent results with default quantization settings and I fixed it by setting per_channel=True and reduce_range=True. I'm not sure if your tests included this, but these should (basically) be the only differences between your and my model.

See my conversion script for more details on the process (which, of course, uses optimum in the background).

gidzr commented 1 year ago

@xenova Hey, no worries re testing :) .. thanks heaps for the steer on your script with channel/ranges.

After running various combos, the ultimate winner by knockout, is the Xenova convert.py with per_channel=true and reduce_range=true.

It's not just per_channel that was important, when I ran with reduce_range=false, or using the CLI method which only has the --per-channel option (reduce_range flag not available), all I got was a garbage out.

Also discovered that the onnxruntime 1.14 doesn't handle quantized version opsets from the Xenova script, only 1.16 does. In an unrelated issue, I was told to drop my version of onnxruntime to try quantize Llama2 (https://github.com/huggingface/optimum/issues/1409#issuecomment-1735042118), hence the older version when testing. However, the current results are based on the latest 1.16 version. I'll re-test using 1.15 for Llama

To wrap things up.. What is opset? I've checked via google and bing, I can't find much on opset aside from 19 is experimental and below 9 might be deprecated. Is it an umbrella concept for chipset support (avx512/arm/etc) and optimization strategy (O1-O4), or something completely different? What are the definitions of 1-19 opsets so I can select the most appropriate for my situation?

Many thanks!

RAW RESULTS

CLI with --per_channel "ThereIII I�amam�am�II�I�vamamamII can�amI’namam-I”I�ia of the I can IIamam Iam’I�ima of the time of theIamn’amamThereI� I ThereI canII usually of the space of time of I ready of theamam canI�"

Convert.py with per_channel=true, reduce_range=false "The new year is a good time for a new year to be a good one. The new year can be a great time for the new year. The New Year is a great year for the first year. I'm a little bit of a new Yorker. I was a little of a little girl. I had a little boy. I think that's what the new world is like. I don't know what the world is. is like, but I don’t know what"

Convert.py with per_channel=true, reduce_range=true "A black hole is a region of spacetime where gravity is so strong that nothing can escape. The first black hole known was Cygnus X-1, identified by several researchers independently in 1971. Black holes of stellar mass form when massive stars collapse at the end of their life cycle."

TESTING SCENARIOS

Running CLI method -> with --per_channel and without -> changing O1 to O4, -> changing avx to avx512_vnni
Running convert.py method ->--opset 11, 18, and unset -> --per_channel true --reduce_range true -> --per_channel false--reduce_range true -> --per_channel true --reduce_range false

xenova commented 1 year ago

Very interesting 👍 Good to know it aligns with the default settings for transformers.js models (which I chose based on my own small-scale tests).

cc @pcuenca (re: benchmarking efforts)

pcuenca commented 1 year ago

Very interesting indeed! Very surprised to see such huge differences with similar looking configurations.

gidzr commented 1 year ago

@xenova might be time for the current CLI / inference quantization methods to be retired in favour of the Joshy special!

@xenova, @fxmarty, @pcuenca: Are there any notes, docs, or urls regarding opsets? All I can find is: https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html

gidzr commented 1 year ago

Hey @xenova

Came across an issue, that might be related to onnxruntime, or some other restriction related to optimization/quantization.

I downgraded to onnxruntime 1.15, using opset 16 and 17, to quantize Llama based models.

Problems, optimization occurs, but quantization fails / killed. I thought this was due to memory limitations, but this was occurring for optimized models of 3G size, which weren't an issue for other architectures. eg. PY007/TinyLlama-1.1B-Chat-v0.2 will optimize to 3G then fail to quantize. The other Llama models will optimize to 26G or 12G files, for the various different Llama flavours/repos.

Are there any limitations when quantizing Llama models with the Xenova convert.py script?

Cheers

xenova commented 1 year ago

Are there any limitations when quantizing Llama models with the Xenova convert.py script?

It should act in the same way as performing the two steps with optimum separately: (1) conversion, followed by (2) quantization. So, nothing extra in terms of memory usage.

Problems, optimization occurs, but quantization fails / killed. I thought this was due to memory limitations, but this was occurring for optimized models of 3G size

Could you share the id of the model you are trying to convert? Are the weights stored as 4, 8, 16, or 32-bit? I do know that the quantization process is quite memory-hungry, and also depends on some other quantization settings.

gidzr commented 1 year ago

@xenova

Sure thing, these are the conditions for the testing:

Environment

In all cases, I'm running server with CPU, 8 cores, 32Gram, 580Gssd, 60Gswapram, Debian 11, Apache latest. I also originally tested with onnxruntime 1.15 and opset 17, but the latest onnxruntime 1.16.1 and opset 18 works now with llama, which re-tested with same results as the 1.15/opset17 condition.

Outline of Steps

The xenova / joshy special script successfully optimized-exported these llama based models to onnx files (going from pytorch 12G size to 26G size onnx optimized).
The second half of the xenova script to quantize failed.
I then continued the process using the CLI onnx runtime script to convert the onnx optimized models, and this was successful, in most cases reducing a 26G optimized onnx file to a quantized onnx 6G size.
Mostly these were llama based, but not solely, and either size, architecture, or both may have played a role.

Script settings and results

Models tested, using both "causal-lm", and "default" settings, with the channel and ranges = true (per above discussion)

'codellama/CodeLlama-7b-Instruct-hf', //CLI quanted to a 6G file 'togethercomputer/Llama-2-7B-32K-Instruct', //CLI quanted to a 6G file 'mosaicml/mpt-7b-instruct', //CLI quanted to a 6G file 'togethercomputer/RedPajama-INCITE-7B-Instruct',//CLI quanted to a 6G file 'VMware/open-llama-7b-v2-open-instruct', //CLI quanted to a 6G file 'mlfoundations/open_lm_1B', //CLI quanted to a 6G file 'PY007/TinyLlama-1.1B-Chat-v0.2', //CLI quanted to a 1G file 'microsoft/DialoGPT-large', //CLI quanted to a <1G file

Cheers

huggingface / optimum