neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
3.01k stars 176 forks source link

[deepsparse.benchmark] enable internal kv cache by default #1335

Closed bfineran closed 1 year ago

bfineran commented 1 year ago

if a model is detected to have kv cache, default behavior on deepsparse engine is to run with internal enabled

also adds argument disable-kv-cache-overrides to skip any kv cache updates (addresses previous need for the sequence_length set requirement)

example: with internal cache deepsparse.benchmark zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/mpt_chat/base-none Throughput (items/sec): 2.6370

with external cache deepsparse.benchmark zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/mpt_chat/base-none --no-internal-kv-cache Throughput (items/sec): 2.2022

with no model edits deepsparse.benchmark zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/mpt_chat/base-none --disable-kv-cache-overrides Throughput (items/sec): 2.1987