Generation using cache gives weird sentences

While using cache (past_key_values) during speculative decoding or even autoregressive decoding, the resulting generated tokens might be somewhat weird and non-sense. Because of this behavior, speculative sampling is slowed down (sometimes even being slower than AR decoding).

speculative_generate edits the cache by pruning the last tokens when rejection happens. I first thought the errors came from this. But the generation is also weird in autoregressive_generate even though the cache is not edited nor pruned.

That leads to think:

Am I using the past_key_values wrongly (putting as forward parameter, and getting the newly KVcache in the output)?
Or is this a problem coming from my transformers/torch versions (latest stable)?
Or is this an issue from transformers itself?

I will greatly appreciate any help or advices in here! Thanks.

romsto / Speculative-Decoding

Generation using cache gives weird sentences #1