[Text Generation] Terminate the inference when kv cache is full

Feature Description

Once the KV cache is full, instead of continuing the inference by removing the old cache entries to make place for the new ones, we now terminate the inference with the finish reason "capacity"

Manual Testing

from deepsparse import Pipeline
prompt = "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?"
model_path = "zoo:llama2-7b-gsm8k_llama2_pretrain-pruned80_quantized"
pipeline = Pipeline.create(task="text-generation", model_path=model_path, sequence_length=64)
out = pipeline(prompt=prompt)

Before:

displaying out.generations[0].text and out.generations[0].finished_reason:

text='He runs 60*3=<<60*3=180>>180 meters in total per sprint,  Comays 5  \nen3 was 2= a en6ound'
finished=True, finished_reason='max_new_tokens'

Now:

ext='He runs 60*3=<<60*3=180>>180 meters in total per s'
finished=True, finished_reason='capacity'

neuralmagic / deepsparse

[Text Generation] Terminate the inference when kv cache is full #1446

Feature Description

Manual Testing

Before:

Now: