neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.98k stars 172 forks source link

[Pipeline Refactor][Text-Generation] Simplify `DecoderKVCache` #1370

Closed dbogunowicz closed 9 months ago

dbogunowicz commented 10 months ago

Feature Description

In v2, the TextGenerationPipeline no longer can process prompts of arbitrary length. Before, once our cache buffer has been filled, we could start discarding the old cache entries to make room for new entries ("sliding window" principle). After the refactor, we want to cap our buffer to a fixed length. Once we go past the length, we shall raise a runtime error to the user, that we've reached the limits of the context size (similar to what is being done in HF pipelines).

The new logic is much simpler, so it also made sense to simplify the code.

Manual Testing

from deepsparse.v2.text_generation.pipeline import TextGenerationPipeline

pipeline = TextGenerationPipeline(model_path="hf:mgoin/TinyStories-1M-deepsparse", internal_kv_cache=False)
print(pipeline(prompt=["Who is the president of the USA?"]))

pipeline = TextGenerationPipeline(model_path="hf:mgoin/TinyStories-1M-deepsparse", sequence_length=32, internal_kv_cache=False)
print(pipeline(prompt=["Who is the president of the USA?"]))

Output:

/home/ubuntu/damian/deepsparse_venv/bin/python /home/ubuntu/damian/deepsparse/hehe.py 
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 63990.77it/s]
2023-11-07 12:05:16 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/.cache/huggingface/hub/models--mgoin--TinyStories-1M-deepsparse/snapshots/ca4ce12f6093b31f6c3f1e398f4b04b113e26bb7/model.onnx
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20231102 COMMUNITY | (7714bc73) (release) (optimized) (system=avx2, binary=avx2)
2023-11-07 12:05:17 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/.cache/huggingface/hub/models--mgoin--TinyStories-1M-deepsparse/snapshots/ca4ce12f6093b31f6c3f1e398f4b04b113e26bb7/model.onnx
2023-11-07 12:05:18 deepsparse.v2.text_generation.prep_for_prefill WARNING  This operator requires the PipelineState to be set-up with the cache_shape, output_names, kv_cache_data_type attributes to be set from the NLEngineOperator
2023-11-07 12:05:18 deepsparse.v2.text_generation.multi_engine_prefill_operator WARNING  This operator requires the PipelineState to be set-up with the onnx_input_names_no_cache attribute set from the NLEngineOperator.
2023-11-07 12:05:18 deepsparse.v2.text_generation.autoregressive_preprocess_operator WARNING  This operator requires the PipelineState to be set-up with the onnx_input_names_no_cache attribute set from the NLEngineOperator.
created=datetime.datetime(2023, 11, 7, 12, 5, 19, 74616) prompts=['Who is the president of the USA?'] generations=[GeneratedText(text='\n\nThe man was very surprised and he said, "I\'m sorry, I didn\'t know. I just wanted to be your friend."\n\nThe man smiled and said, "I\'m sorry, but I\'m glad you are safe. I\'m glad you\'re safe."\n\nThe man smiled and said, "I\'m glad you\'re safe. I\'m glad you\'re safe."\n', score=None, finished=True, finished_reason='stop')]
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 17260.51it/s]
2023-11-07 12:05:19 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/.cache/huggingface/hub/models--mgoin--TinyStories-1M-deepsparse/snapshots/ca4ce12f6093b31f6c3f1e398f4b04b113e26bb7/model.onnx
2023-11-07 12:05:20 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/ubuntu/.cache/huggingface/hub/models--mgoin--TinyStories-1M-deepsparse/snapshots/ca4ce12f6093b31f6c3f1e398f4b04b113e26bb7/model.onnx
2023-11-07 12:05:21 deepsparse.v2.text_generation.prep_for_prefill WARNING  This operator requires the PipelineState to be set-up with the cache_shape, output_names, kv_cache_data_type attributes to be set from the NLEngineOperator
2023-11-07 12:05:21 deepsparse.v2.text_generation.multi_engine_prefill_operator WARNING  This operator requires the PipelineState to be set-up with the onnx_input_names_no_cache attribute set from the NLEngineOperator.
2023-11-07 12:05:21 deepsparse.v2.text_generation.autoregressive_preprocess_operator WARNING  This operator requires the PipelineState to be set-up with the onnx_input_names_no_cache attribute set from the NLEngineOperator.
Traceback (most recent call last):
  File "/home/ubuntu/damian/deepsparse/hehe.py", line 8, in <module>
    print(pipeline(prompt=["Who is the president of the USA?"]))
  File "/home/ubuntu/damian/deepsparse/src/deepsparse/v2/pipeline.py", line 137, in __call__
    return self.run(*args, **kwargs)
  File "/home/ubuntu/damian/deepsparse/src/deepsparse/v2/pipeline.py", line 107, in run
    operator_output = output_future.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/damian/deepsparse/src/deepsparse/v2/operators/operator.py", line 90, in __call__
    run_output = self.run(
  File "/home/ubuntu/damian/deepsparse/src/deepsparse/v2/text_generation/nl_engine_operator.py", line 117, in run
    self._update_kv_cache(
  File "/home/ubuntu/damian/deepsparse/src/deepsparse/v2/text_generation/nl_engine_operator.py", line 155, in _update_kv_cache
    raise RuntimeError(
RuntimeError: The kv_cache buffer is full. To increase the buffer size, increase the value of the `sequence_length` argument of the operator.

This functionality shall be additionally tested, preferably using unit-tests.

dbogunowicz commented 10 months ago

@bfineran I was not aware that there are potential models that use the Rolling buffer cache. Maybe let's not remove this functionality after all? I mean ultimately, it should be product to decide.