if a model is detected to have kv cache, default behavior on deepsparse engine is to run with internal enabled
also adds argument disable-kv-cache-overrides to skip any kv cache updates (addresses previous need for the sequence_length set requirement)
example:
with internal cache
deepsparse.benchmark zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/mpt_chat/base-noneThroughput (items/sec): 2.6370
with external cache
deepsparse.benchmark zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/mpt_chat/base-none --no-internal-kv-cacheThroughput (items/sec): 2.2022
with no model edits
deepsparse.benchmark zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/mpt_chat/base-none --disable-kv-cache-overridesThroughput (items/sec): 2.1987
if a model is detected to have kv cache, default behavior on deepsparse engine is to run with internal enabled
also adds argument
disable-kv-cache-overrides
to skip any kv cache updates (addresses previous need for thesequence_length
set requirement)example: with internal cache
deepsparse.benchmark zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/mpt_chat/base-none
Throughput (items/sec): 2.6370
with external cache
deepsparse.benchmark zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/mpt_chat/base-none --no-internal-kv-cache
Throughput (items/sec): 2.2022
with no model edits
deepsparse.benchmark zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/mpt_chat/base-none --disable-kv-cache-overrides
Throughput (items/sec): 2.1987