In v2, the TextGenerationPipeline no longer can process prompts of arbitrary length. Before, once our cache buffer has been filled, we could start discarding the old cache entries to make room for new entries ("sliding window" principle). After the refactor, we want to cap our buffer to a fixed length. Once we go past the length, we shall raise a runtime error to the user, that we've reached the limits of the context size (similar to what is being done in HF pipelines).
The new logic is much simpler, so it also made sense to simplify the code.
Manual Testing
from deepsparse.v2.text_generation.pipeline import TextGenerationPipeline
pipeline = TextGenerationPipeline(model_path="hf:mgoin/TinyStories-1M-deepsparse", internal_kv_cache=False)
print(pipeline(prompt=["Who is the president of the USA?"]))
pipeline = TextGenerationPipeline(model_path="hf:mgoin/TinyStories-1M-deepsparse", sequence_length=32, internal_kv_cache=False)
print(pipeline(prompt=["Who is the president of the USA?"]))
Output:
/home/ubuntu/damian/deepsparse_venv/bin/python /home/ubuntu/damian/deepsparse/hehe.py
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 63990.77it/s]
2023-11-07 12:05:16 deepsparse.utils.onnx INFO Overwriting in-place the input shapes of the transformer model at /home/ubuntu/.cache/huggingface/hub/models--mgoin--TinyStories-1M-deepsparse/snapshots/ca4ce12f6093b31f6c3f1e398f4b04b113e26bb7/model.onnx
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20231102 COMMUNITY | (7714bc73) (release) (optimized) (system=avx2, binary=avx2)
2023-11-07 12:05:17 deepsparse.utils.onnx INFO Overwriting in-place the input shapes of the transformer model at /home/ubuntu/.cache/huggingface/hub/models--mgoin--TinyStories-1M-deepsparse/snapshots/ca4ce12f6093b31f6c3f1e398f4b04b113e26bb7/model.onnx
2023-11-07 12:05:18 deepsparse.v2.text_generation.prep_for_prefill WARNING This operator requires the PipelineState to be set-up with the cache_shape, output_names, kv_cache_data_type attributes to be set from the NLEngineOperator
2023-11-07 12:05:18 deepsparse.v2.text_generation.multi_engine_prefill_operator WARNING This operator requires the PipelineState to be set-up with the onnx_input_names_no_cache attribute set from the NLEngineOperator.
2023-11-07 12:05:18 deepsparse.v2.text_generation.autoregressive_preprocess_operator WARNING This operator requires the PipelineState to be set-up with the onnx_input_names_no_cache attribute set from the NLEngineOperator.
created=datetime.datetime(2023, 11, 7, 12, 5, 19, 74616) prompts=['Who is the president of the USA?'] generations=[GeneratedText(text='\n\nThe man was very surprised and he said, "I\'m sorry, I didn\'t know. I just wanted to be your friend."\n\nThe man smiled and said, "I\'m sorry, but I\'m glad you are safe. I\'m glad you\'re safe."\n\nThe man smiled and said, "I\'m glad you\'re safe. I\'m glad you\'re safe."\n', score=None, finished=True, finished_reason='stop')]
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 17260.51it/s]
2023-11-07 12:05:19 deepsparse.utils.onnx INFO Overwriting in-place the input shapes of the transformer model at /home/ubuntu/.cache/huggingface/hub/models--mgoin--TinyStories-1M-deepsparse/snapshots/ca4ce12f6093b31f6c3f1e398f4b04b113e26bb7/model.onnx
2023-11-07 12:05:20 deepsparse.utils.onnx INFO Overwriting in-place the input shapes of the transformer model at /home/ubuntu/.cache/huggingface/hub/models--mgoin--TinyStories-1M-deepsparse/snapshots/ca4ce12f6093b31f6c3f1e398f4b04b113e26bb7/model.onnx
2023-11-07 12:05:21 deepsparse.v2.text_generation.prep_for_prefill WARNING This operator requires the PipelineState to be set-up with the cache_shape, output_names, kv_cache_data_type attributes to be set from the NLEngineOperator
2023-11-07 12:05:21 deepsparse.v2.text_generation.multi_engine_prefill_operator WARNING This operator requires the PipelineState to be set-up with the onnx_input_names_no_cache attribute set from the NLEngineOperator.
2023-11-07 12:05:21 deepsparse.v2.text_generation.autoregressive_preprocess_operator WARNING This operator requires the PipelineState to be set-up with the onnx_input_names_no_cache attribute set from the NLEngineOperator.
Traceback (most recent call last):
File "/home/ubuntu/damian/deepsparse/hehe.py", line 8, in <module>
print(pipeline(prompt=["Who is the president of the USA?"]))
File "/home/ubuntu/damian/deepsparse/src/deepsparse/v2/pipeline.py", line 137, in __call__
return self.run(*args, **kwargs)
File "/home/ubuntu/damian/deepsparse/src/deepsparse/v2/pipeline.py", line 107, in run
operator_output = output_future.result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/damian/deepsparse/src/deepsparse/v2/operators/operator.py", line 90, in __call__
run_output = self.run(
File "/home/ubuntu/damian/deepsparse/src/deepsparse/v2/text_generation/nl_engine_operator.py", line 117, in run
self._update_kv_cache(
File "/home/ubuntu/damian/deepsparse/src/deepsparse/v2/text_generation/nl_engine_operator.py", line 155, in _update_kv_cache
raise RuntimeError(
RuntimeError: The kv_cache buffer is full. To increase the buffer size, increase the value of the `sequence_length` argument of the operator.
This functionality shall be additionally tested, preferably using unit-tests.
@bfineran I was not aware that there are potential models that use the Rolling buffer cache. Maybe let's not remove this functionality after all? I mean ultimately, it should be product to decide.
Feature Description
In v2, the
TextGenerationPipeline
no longer can process prompts of arbitrary length. Before, once our cache buffer has been filled, we could start discarding the old cache entries to make room for new entries ("sliding window" principle). After the refactor, we want to cap our buffer to a fixed length. Once we go past the length, we shall raise a runtime error to the user, that we've reached the limits of the context size (similar to what is being done in HF pipelines).The new logic is much simpler, so it also made sense to simplify the code.
Manual Testing
Output:
This functionality shall be additionally tested, preferably using unit-tests.