Closed mgiessing closed 8 months ago
Hi! I can't reproduce it, can you please share more of the experiment config, benchmark report and cli log ?
is this using main branch ? can you try running with 0.0.1
? to see if it's caused by one of my new PRs
I was using the main branch, my problem with version 0.0.1 is it doesn't seem to know any keys in the backend
section (such as export: false
which I use because I'm running "pre-converted" models, since I don't have a lot of space in my root partition.
I'll try to update the benchmark config, so it runs with 0.0.1.
If you want to try to replicate my setup, I did the following: (Let's use Tinyllama as it is much smaller but has the same "negative throughput" issue)
optimum-cli export onnx --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 /data/LLMs/onnx/tinyllama_onnx/
using this benchmark config (changed bs to only high numbers and model to tinyllama):
defaults:
- backend: onnxruntime # default backend
- launcher: process # default launcher
- benchmark: inference # default benchmark
- experiment # inheriting experiment schema
- _self_ # for hydra 1.1 compatibility
- override hydra/job_logging: colorlog # colorful logging
- override hydra/hydra_logging: colorlog # colorful logging
experiment_name: tinyllama_ort
model: tinyllama-model
backend:
device: cpu
export: false
task: text-generation
library: transformers
benchmark:
memory: true
warmup_runs: 10
input_shapes:
batch_size: 1
sequence_length: 256
new_tokens: 512
# hydra/cli specific settings
hydra:
run:
# where to store run results
dir: experiments/${experiment_name}
sweep:
# where to store sweep results
dir: experiments/${experiment_name}
job:
# change working directory to the run directory
chdir: true
env_set:
# set environment variable OVERRIDE_BENCHMARKS to 1
# to not skip benchmarks that have been run before
OVERRIDE_BENCHMARKS: 1
sweeper:
params:
benchmark.input_shapes.batch_size: 64,128
model: /data/LLMs/onnx/tinyllama_onnx
Run with
optimum-benchmark --config-dir benchmark_configs/ --config-name tinyllama_ort --multirun
@mgiessing thanks I was able to reproduce it with smaller values seq len and new tokens.
it seems that for some reason generate
takes less time than forward
which doesn't make sense
[PROC-0][2024-02-19 17:25:34,014][latency][INFO] - + forward latency: 6.503192 ± 2 x 0.068859 (s)
[PROC-0][2024-02-19 17:25:44,944][latency][INFO] - + generate latency: 1.214331 ± 2 x 0.003735 (s)
the way decoding latency is computed is based on these two measurements (prefill=forward, decode=generate-forward), I'll investigate what's happening here.
okay found it, i see why it's only happening with onnxruntime
, I was filtering some inputs to avoid passing postion_ids
to models that don't require them (causal LMs are a mess), but other inputs got filtered accidently, and since generate
can be ran with empty input it didn't raise any errors (it just decodes starting from a bos token).
Thanks a lot for spotting this ! I'll fix it asap and add some sanity checks to avoid this ever occurring.
Ah amazing - thanks for your quick investigation and finding! That is really appreciated :)
@mgiessing fixed in in #129 🤗 also added some logging to notify users when an input is generated but removed for a specific backend
[08:35:31][test-cli][INFO] - [PROC-0][2024-02-20 08:35:31,920][onnxruntime][WARNING] - Input position_ids is not in expected inputs names. Removing it.
Thanks a lot @IlyasMoutawwakil - much appreciated & everything works fine now :)
Hi, I'm using the optimum-benchmark for onnxruntime backend (cpu), but for larger batch sizes I get negative values which doesn't seem correct - is this a known bug?
This is the param setup:
E.g. for llama-2 ONNX models with onnxruntime (there are always 2 runs for each batch size, one for fp32 the other for int8)
As you can see for bs=64 the numbers are negative - is this something you can confirm/reproduce with onnxruntime?
Thanks!