neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.94k stars 169 forks source link

[TextGeneration] Split up `prep_for_generation` operator, handle edge cases, handle kv_cache full during prefill #1562

Closed dsikka closed 5 months ago

dsikka commented 5 months ago

Summary

A series of improvements:

Testing