tomaarsen / attention_sinks

Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
https://huggingface.co/blog/tomaarsen/attention-sinks
Apache License 2.0
650 stars 41 forks source link

Last generated token getting ignored in streaming.py? #45

Open ritik99 opened 4 months ago

ritik99 commented 4 months ago

Hello,

I was looking into the streaming.py code and noticed that in greedy_generate() we overwrite what the previous input_ids was on line 33. According to my understanding of the code, this will replace whatever token is last generated and assigned to input_ids on line 42. Essentially the last generated token for every prompt is never considered as a Query token.

I don't think it will lead to any significant change in the model outputs, but just wanted to confirm if my understanding is correct.

Thanks for sharing this implementation btw!

Ritik