Once the KV cache is full, instead of continuing the inference by removing the old cache entries to make place for the new ones, we now terminate the inference with the finish reason "capacity"
Manual Testing
from deepsparse import Pipeline
prompt = "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?"
model_path = "zoo:llama2-7b-gsm8k_llama2_pretrain-pruned80_quantized"
pipeline = Pipeline.create(task="text-generation", model_path=model_path, sequence_length=64)
out = pipeline(prompt=prompt)
Before:
displaying out.generations[0].text and out.generations[0].finished_reason:
text='He runs 60*3=<<60*3=180>>180 meters in total per sprint, Comays 5 \nen3 was 2= a en6ound'
finished=True, finished_reason='max_new_tokens'
Now:
ext='He runs 60*3=<<60*3=180>>180 meters in total per s'
finished=True, finished_reason='capacity'
Feature Description
Once the KV cache is full, instead of continuing the inference by removing the old cache entries to make place for the new ones, we now terminate the inference with the finish reason "capacity"
Manual Testing
Before:
displaying
out.generations[0].text
andout.generations[0].finished_reason
:Now: