neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.99k stars 173 forks source link

Add `--num-kv-cache-tokens` to benchmarking scripts #1513

Closed mgoin closed 9 months ago

mgoin commented 9 months ago

It will default to 1 or NM_BENCHMARK_KV_TOKENS, if it is set

> deepsparse.benchmark hf:mgoin/TinyStories-1M-ds --num-kv-cache-tokens 1 -ncores 1 -q
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 15666.33it/s]
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20231210 COMMUNITY | (99472380) (release) (optimized) (system=neon, binary=neon)
Original Model Path: hf:mgoin/TinyStories-1M-ds
Batch Size: 1
Scenario: sync
Throughput (items/sec): 1158.0933
Latency Mean (ms/batch): 0.8611
Latency Median (ms/batch): 0.8375
Latency Std (ms/batch): 0.0655
Iterations: 11581

> deepsparse.benchmark hf:mgoin/TinyStories-1M-ds --num-kv-cache-tokens 1000 -ncores 1 -q
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 42134.56it/s]
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20231210 COMMUNITY | (99472380) (release) (optimized) (system=neon, binary=neon)
Original Model Path: hf:mgoin/TinyStories-1M-ds
Batch Size: 1
Scenario: sync
Throughput (items/sec): 922.7386
Latency Mean (ms/batch): 1.0813
Latency Median (ms/batch): 1.0664
Latency Std (ms/batch): 0.0502
Iterations: 9228
tlrmchlsmth commented 9 months ago

Looks good to me. I think it would be nice to clarify for the user what the valid values of num-kv-cache-tokens are. It's the number of previous tokens in cache, so must be between 0 and context_length - prompt_processing_length