huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.6k stars 476 forks source link

Strangely longer for CPU inference using ONNX than the original PY repo #1385

Open gidzr opened 1 year ago

gidzr commented 1 year ago

System Info

Debian 11 on CPU, Python3.10
optimum : 1.13.1
onnx : 1.14.1
onnxruntime : 1.15.1

Who can help?

@philschmid, @michaelbenayoun, @JingyaHuang, @echarlaix

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Hey there, I've been running inference a few different ways to see what's faster, and been getting weird results.

I've been using
-> ORTModelForSeq2SeqLM -> task: "summarization" with pipeline -> model: Xenova/bart-large-cnn, which I've pulled to local drive and running using direct Path(/home/admin/models/...) -> decoder_model_quantized, and encoder_model_quantized.

I've limited max new tokens and max length to 100.

I've been using an excerpt from a science paper on black holes (coz they are cool)... which is 2437 chars, or 502 tokens in length.

Running optimum runtime pipeline using the recommended setup: https://huggingface.co/docs/optimum/v1.3.0/en/onnxruntime/modeling_ort#optimum.onnxruntime.ORTModelForSeq2SeqLM.forward.example-2

Expected behavior

When running the optimum approach per link (above in reproducing steps) this takes 2.5 to 3 minutes to summarize the text.

But using the very basic pipeline from ffacebook/bart-large-cnn, https://huggingface.co/facebook/bart-large-cnn#how-to-use, it only takes 50seconds to summarize the same text.

I'm not sure what's going on.. I was expecting the onnx method to blitz the original.

Is it a mismatch between the task/model/class method? .. or is the Xenova onnx quant done on a different cpu/gpu setup going to make the quant version inappropriate for my specs?

What's holding onnx back vs the original?

fxmarty commented 1 year ago

Hi @gidzr, can you try using ORTModelForSeq2SeqLM.from_pretrained(..., use_io_binding=True) to initialize the ORT model?

In Xenova/bart-large-cnn there is a quantized and non-quantized model in the same repo (which is probably not loadable in ORTModel actually @xenova). Which one are you using?

gidzr commented 1 year ago

@fxmarty I manually changed the model names the local folder to ensure the additional models aren't detected. I changed the extension of all the optional models by adding XXX to the extensions so they aren't recognised as onnx file mime types. When there are multiple onnx files, I get the error response in CLI points out that there are multiple models.

I was previously using "use_cache":False, "use_io_binding":False, because the -with-past is not being used..

I just tried with "use_cache":False, "use_io_binding":True, but got the following error:

   When using CUDAExecutionProvider, the parameters combination use_cache=False, use_io_binding=True is not supported. 
   Please either pass use_cache=True, use_io_binding=True (default), or use_cache=False, use_io_binding=False.

It doesn't appear that it's possible to set io_binding True unless using the with-past (cache) version of the model.

xenova commented 1 year ago

Hi @gidzr, can you try using ORTModelForSeq2SeqLM.from_pretrained(..., use_io_binding=True) to initialize the ORT model?

In Xenova/bart-large-cnn there is a quantized and non-quantized model in the same repo (which is probably not loadable in ORTModel actually @xenova). Which one are you using?

Here are the quantization settings I was using (along with which ops were quantized and their weight types): https://huggingface.co/Xenova/bart-large-cnn/blob/main/quantize_config.json.

gidzr commented 1 year ago

Hi @xenova

1) apologies for the shocking grammar in my previous explanation. That's a 2am message. 2) just to clarify, I've ensured that only 1 type of model is in the folder when inference is run. I disable the non-quant files by changing their file extensions/mimes. 3) I'm going to try optimise/export and quant onnx again on my localserver to see if the issue arise with ONNX generally, or specifically with the Xenova version and will put my results here.

Chat soon.

gidzr commented 1 year ago

Hi @xenova and @fxmarty

I wanted to follow up with some testing, as I've asked a few questions that can be rolled up and wanted to get something of substance back to you.

Based on the results, my only remaining question is:
What could cause the Homebaked onnx to go crazy.. even after adjusting all the Opt and Quant settings available to CLI command?

Inference results produced by Homebaked ONNX quantization is incoherent babbling (comparison results at bottom of page): [{'summary_text': 'The black hole is a space space, a space of space, in the universe, and the space of the universe. In the space, the black holes of the galaxies of the space space. The black holes is a gravitational gravitational energy. The first black hole in the Universe, a black hole, is known to have the energy of a black space. A black hole of the stars, the energy energy of the black space, is a energy.'}]

Couple of extra weird bugs for Xenova:

Observations:

Incoherent results produced by Homebaked ONNX quantization: [{'summary_text': 'The black hole is a space space, a space of space, in the universe, and the space of the universe. In the space, the black holes of the galaxies of the space space. The black holes is a gravitational gravitational energy. The first black hole in the Universe, a black hole, is known to have the energy of a black space. A black hole of the stars, the energy energy of the black space, is a energy.'}]

VS a good summary, produced by Xenova ONNX: [{'summary_text': 'A black hole is a region of spacetime where gravity is so strong that nothing can escape. The first black hole known was Cygnus X-1, identified by several researchers independently in 1971. The presence of a black hole can be inferred through its interaction with other matter and with electromagnetic radiation..'}]

TESTING Results

A. Model-Tokenize (not pipeline) Inference

  1. Baseline - Local pytorch: 28 seconds with Accelerate model, to(device=cpu); summary was coherent
  2. Repo: "Xenova/bart-large-cnn" using ORTModelFor..: LOTS of junk warning or issues with quantization? ... lots of these: 2023-09-26 05:56:04.176734927 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.6/self_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model. ...
  3. LOCAL: "/home/xxx/xenova_bart-large-cnn" using ORT pipeline: 20 seconds, summary was coherent
  4. LOCAL: Homebaked onnx quant facebook-bart-larg-cnn: 24 seconds, summary was RANDOM mess.

B. Pipeline Inference

  1. Baseline - Local pytorch: 25 seconds with Accelerate ; summary was coherent
  2. Repo: "Xenova/bart-large-cnn" using ORT pipeline: ERRORS Repo id must use alphanumeric chars or '-', '_', '.', '--' and...
  3. LOCAL: "/home/xxx/xenova_bart-large-cnn" using ORT pipeline: 20 seconds, summary was coherent
  4. LOCAL: Homebaked onnx quant facebook-bart-larg-cnn: 24 seconds, summary was RANDOM mess.
xenova commented 1 year ago

@gidzr Thanks for the extensive tests! I also had issues with incoherent results with default quantization settings and I fixed it by setting per_channel=True and reduce_range=True. I'm not sure if your tests included this, but these should (basically) be the only differences between your and my model.

See my conversion script for more details on the process (which, of course, uses optimum in the background).

gidzr commented 1 year ago

@xenova Hey, no worries re testing :) .. thanks heaps for the steer on your script with channel/ranges.

After running various combos, the ultimate winner by knockout, is the Xenova convert.py with per_channel=true and reduce_range=true.

It's not just per_channel that was important, when I ran with reduce_range=false, or using the CLI method which only has the --per-channel option (reduce_range flag not available), all I got was a garbage out.

Also discovered that the onnxruntime 1.14 doesn't handle quantized version opsets from the Xenova script, only 1.16 does. In an unrelated issue, I was told to drop my version of onnxruntime to try quantize Llama2 (https://github.com/huggingface/optimum/issues/1409#issuecomment-1735042118), hence the older version when testing. However, the current results are based on the latest 1.16 version. I'll re-test using 1.15 for Llama

To wrap things up.. What is opset? I've checked via google and bing, I can't find much on opset aside from 19 is experimental and below 9 might be deprecated. Is it an umbrella concept for chipset support (avx512/arm/etc) and optimization strategy (O1-O4), or something completely different? What are the definitions of 1-19 opsets so I can select the most appropriate for my situation?

Many thanks!

RAW RESULTS

CLI with --per_channel "ThereIII I�amam�am�II�I�vamamamII can�amI’namam-I”I�ia of the I can IIamam Iam’I�ima of the time of theIamn’amamThereI� I ThereI canII usually of the space of time of I ready of theamam canI�"

Convert.py with per_channel=true, reduce_range=false "The new year is a good time for a new year to be a good one. The new year can be a great time for the new year. The New Year is a great year for the first year. I'm a little bit of a new Yorker. I was a little of a little girl. I had a little boy. I think that's what the new world is like. I don't know what the world is. is like, but I don’t know what"

Convert.py with per_channel=true, reduce_range=true "A black hole is a region of spacetime where gravity is so strong that nothing can escape. The first black hole known was Cygnus X-1, identified by several researchers independently in 1971. Black holes of stellar mass form when massive stars collapse at the end of their life cycle."

TESTING SCENARIOS

  1. Running CLI method -> with --per_channel and without -> changing O1 to O4, -> changing avx to avx512_vnni

  2. Running convert.py method ->--opset 11, 18, and unset -> --per_channel true --reduce_range true -> --per_channel false--reduce_range true -> --per_channel true --reduce_range false

xenova commented 1 year ago

Very interesting 👍 Good to know it aligns with the default settings for transformers.js models (which I chose based on my own small-scale tests).

cc @pcuenca (re: benchmarking efforts)

pcuenca commented 1 year ago

Very interesting indeed! Very surprised to see such huge differences with similar looking configurations.

gidzr commented 1 year ago

@xenova might be time for the current CLI / inference quantization methods to be retired in favour of the Joshy special!

@xenova, @fxmarty, @pcuenca: Are there any notes, docs, or urls regarding opsets? All I can find is: https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html

gidzr commented 1 year ago

Hey @xenova

Came across an issue, that might be related to onnxruntime, or some other restriction related to optimization/quantization.

I downgraded to onnxruntime 1.15, using opset 16 and 17, to quantize Llama based models.

Problems, optimization occurs, but quantization fails / killed. I thought this was due to memory limitations, but this was occurring for optimized models of 3G size, which weren't an issue for other architectures. eg. PY007/TinyLlama-1.1B-Chat-v0.2 will optimize to 3G then fail to quantize. The other Llama models will optimize to 26G or 12G files, for the various different Llama flavours/repos.

Are there any limitations when quantizing Llama models with the Xenova convert.py script?

Cheers

xenova commented 1 year ago

Are there any limitations when quantizing Llama models with the Xenova convert.py script?

It should act in the same way as performing the two steps with optimum separately: (1) conversion, followed by (2) quantization. So, nothing extra in terms of memory usage.

Problems, optimization occurs, but quantization fails / killed. I thought this was due to memory limitations, but this was occurring for optimized models of 3G size

Could you share the id of the model you are trying to convert? Are the weights stored as 4, 8, 16, or 32-bit? I do know that the quantization process is quite memory-hungry, and also depends on some other quantization settings.

gidzr commented 1 year ago

@xenova

Sure thing, these are the conditions for the testing:

Environment

In all cases, I'm running server with CPU, 8 cores, 32Gram, 580Gssd, 60Gswapram, Debian 11, Apache latest. I also originally tested with onnxruntime 1.15 and opset 17, but the latest onnxruntime 1.16.1 and opset 18 works now with llama, which re-tested with same results as the 1.15/opset17 condition.

Outline of Steps

  1. The xenova / joshy special script successfully optimized-exported these llama based models to onnx files (going from pytorch 12G size to 26G size onnx optimized).
  2. The second half of the xenova script to quantize failed.
  3. I then continued the process using the CLI onnx runtime script to convert the onnx optimized models, and this was successful, in most cases reducing a 26G optimized onnx file to a quantized onnx 6G size.
  4. Mostly these were llama based, but not solely, and either size, architecture, or both may have played a role.

Script settings and results

Models tested, using both "causal-lm", and "default" settings, with the channel and ranges = true (per above discussion)

'codellama/CodeLlama-7b-Instruct-hf', //CLI quanted to a 6G file 'togethercomputer/Llama-2-7B-32K-Instruct', //CLI quanted to a 6G file 'mosaicml/mpt-7b-instruct', //CLI quanted to a 6G file 'togethercomputer/RedPajama-INCITE-7B-Instruct',//CLI quanted to a 6G file 'VMware/open-llama-7b-v2-open-instruct', //CLI quanted to a 6G file 'mlfoundations/open_lm_1B', //CLI quanted to a 6G file 'PY007/TinyLlama-1.1B-Chat-v0.2', //CLI quanted to a 1G file 'microsoft/DialoGPT-large', //CLI quanted to a <1G file

Cheers