romsto / Speculative-Decoding

Implementation of the paper Fast Inference from Transformers via Speculative Decoding, Leviathan et al. 2023.
MIT License
11 stars 0 forks source link

Generation using cache gives weird sentences #1

Open romsto opened 2 months ago

romsto commented 2 months ago

While using cache (past_key_values) during speculative decoding or even autoregressive decoding, the resulting generated tokens might be somewhat weird and non-sense. Because of this behavior, speculative sampling is slowed down (sometimes even being slower than AR decoding).

speculative_generate edits the cache by pruning the last tokens when rejection happens. I first thought the errors came from this. But the generation is also weird in autoregressive_generate even though the cache is not edited nor pruned.

That leads to think:

I will greatly appreciate any help or advices in here! Thanks.

vdaita commented 1 month ago

This might be a relevant issue:

https://github.com/huggingface/transformers/issues/26344