neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.97k stars 171 forks source link

[Text Generation] Terminate the inference when kv cache is full #1446

Closed dbogunowicz closed 9 months ago

dbogunowicz commented 9 months ago

Feature Description

Once the KV cache is full, instead of continuing the inference by removing the old cache entries to make place for the new ones, we now terminate the inference with the finish reason "capacity"

Manual Testing

from deepsparse import Pipeline
prompt = "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?"
model_path = "zoo:llama2-7b-gsm8k_llama2_pretrain-pruned80_quantized"
pipeline = Pipeline.create(task="text-generation", model_path=model_path, sequence_length=64)
out = pipeline(prompt=prompt)

Before:

displaying out.generations[0].text and out.generations[0].finished_reason:

text='He runs 60*3=<<60*3=180>>180 meters in total per sprint,  Comays 5  \nen3 was 2= a en6ound'
finished=True, finished_reason='max_new_tokens'

Now:

ext='He runs 60*3=<<60*3=180>>180 meters in total per s'
finished=True, finished_reason='capacity'