Assisted generation (or speculative decoding) is a strategy to speed up generation. Using StaticCache and torch.compile is another strategy to speed up generation. Currently, the two are not compatible. It would be nice to be able to use both at the same time, for maximum speed 😎
In a nutshell, assisted generation has to clear the cache of the models for the tokens that were rejected. StaticCache doesn't have the functions to do it implemented.
Looking for contributions!
Assisted generation (or speculative decoding) is a strategy to speed up generation. Using
StaticCache
andtorch.compile
is another strategy to speed up generation. Currently, the two are not compatible. It would be nice to be able to use both at the same time, for maximum speed 😎In a nutshell, assisted generation has to clear the cache of the models for the tokens that were rejected.
StaticCache
doesn't have the functions to do it implemented.