huggingface / optimum-benchmark

🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.
Apache License 2.0
245 stars 43 forks source link

Getting negative throughput value for large batch sizes #128

Closed mgiessing closed 8 months ago

mgiessing commented 8 months ago

Hi, I'm using the optimum-benchmark for onnxruntime backend (cpu), but for larger batch sizes I get negative values which doesn't seem correct - is this a known bug?

This is the param setup:

[...]
  sweeper:
    params:
      benchmark.input_shapes.batch_size: 1,2,4,8,16,32,64
      model: /data/LLMs/onnx/llama-2-7b_onnx,/data/LLMs/onnx/llama-2-7b_onnx_quant

E.g. for llama-2 ONNX models with onnxruntime (there are always 2 runs for each batch size, one for fp32 the other for int8)

for i in {1..13}; do
    cat "experiments/llama-2-7b_ort/$i/benchmark_report.json" | jq '.batch_size, .decode.throughput."mean[tokens/s]"'
done

1
10.64650340356247
2
12.258327597687689
2
22.51221913908261
4
26.243146016015586
4
46.47543946736317
8
56.99298854476616
8
102.96338019358438
16
156.63038758907555
16
284.71610900884076
32
668.620966745494
32
2320.323920074327
64
-744.356127863481
64
-779.9233473188091

As you can see for bs=64 the numbers are negative - is this something you can confirm/reproduce with onnxruntime?

Thanks!

IlyasMoutawwakil commented 8 months ago

Hi! I can't reproduce it, can you please share more of the experiment config, benchmark report and cli log ? is this using main branch ? can you try running with 0.0.1 ? to see if it's caused by one of my new PRs

mgiessing commented 8 months ago

I was using the main branch, my problem with version 0.0.1 is it doesn't seem to know any keys in the backend section (such as export: false which I use because I'm running "pre-converted" models, since I don't have a lot of space in my root partition.

I'll try to update the benchmark config, so it runs with 0.0.1.

If you want to try to replicate my setup, I did the following: (Let's use Tinyllama as it is much smaller but has the same "negative throughput" issue)

Using huggingface optimum to export to onnx

optimum-cli export onnx --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 /data/LLMs/onnx/tinyllama_onnx/

using this benchmark config (changed bs to only high numbers and model to tinyllama):

defaults:
  - backend: onnxruntime # default backend
  - launcher: process # default launcher
  - benchmark: inference # default benchmark
  - experiment # inheriting experiment schema
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

experiment_name: tinyllama_ort
model: tinyllama-model

backend:
  device: cpu
  export: false
  task: text-generation
  library: transformers

benchmark:
  memory: true
  warmup_runs: 10
  input_shapes:
    batch_size: 1
    sequence_length: 256
  new_tokens: 512

# hydra/cli specific settings
hydra:
  run:
    # where to store run results
    dir: experiments/${experiment_name}
  sweep:
    # where to store sweep results
    dir: experiments/${experiment_name}
  job:
    # change working directory to the run directory
    chdir: true
    env_set:
      # set environment variable OVERRIDE_BENCHMARKS to 1
      # to not skip benchmarks that have been run before
      OVERRIDE_BENCHMARKS: 1
  sweeper:
    params:
      benchmark.input_shapes.batch_size: 64,128
      model: /data/LLMs/onnx/tinyllama_onnx

Run with

optimum-benchmark --config-dir benchmark_configs/ --config-name tinyllama_ort --multirun

IlyasMoutawwakil commented 8 months ago

@mgiessing thanks I was able to reproduce it with smaller values seq len and new tokens. it seems that for some reason generate takes less time than forward which doesn't make sense

[PROC-0][2024-02-19 17:25:34,014][latency][INFO] -              + forward latency: 6.503192 ± 2 x 0.068859 (s)
[PROC-0][2024-02-19 17:25:44,944][latency][INFO] -              + generate latency: 1.214331 ± 2 x 0.003735 (s)

the way decoding latency is computed is based on these two measurements (prefill=forward, decode=generate-forward), I'll investigate what's happening here.

IlyasMoutawwakil commented 8 months ago

okay found it, i see why it's only happening with onnxruntime, I was filtering some inputs to avoid passing postion_ids to models that don't require them (causal LMs are a mess), but other inputs got filtered accidently, and since generate can be ran with empty input it didn't raise any errors (it just decodes starting from a bos token). Thanks a lot for spotting this ! I'll fix it asap and add some sanity checks to avoid this ever occurring.

mgiessing commented 8 months ago

Ah amazing - thanks for your quick investigation and finding! That is really appreciated :)

IlyasMoutawwakil commented 8 months ago

@mgiessing fixed in in #129 🤗 also added some logging to notify users when an input is generated but removed for a specific backend

[08:35:31][test-cli][INFO] - [PROC-0][2024-02-20 08:35:31,920][onnxruntime][WARNING] - Input position_ids is not in expected inputs names. Removing it.
mgiessing commented 8 months ago

Thanks a lot @IlyasMoutawwakil - much appreciated & everything works fine now :)