Support `StaticCache` in assisted generation

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

135.1k stars 27.04k forks source link

Support `StaticCache` in assisted generation #32946

Open gante opened 2 months ago

gante commented 2 months ago

Looking for contributions!

Assisted generation (or speculative decoding) is a strategy to speed up generation. Using StaticCache and torch.compile is another strategy to speed up generation. Currently, the two are not compatible. It would be nice to be able to use both at the same time, for maximum speed 😎

In a nutshell, assisted generation has to clear the cache of the models for the tokens that were rejected. StaticCache doesn't have the functions to do it implemented.

csking101 commented 1 day ago

Hi, is this issue still open?