tomaarsen / attention_sinks

Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
https://huggingface.co/blog/tomaarsen/attention-sinks
Apache License 2.0
650 stars 41 forks source link

The results of sink/transformer/windowed under outputs_*/ folders are all the same #18

Closed ZiweiHe closed 9 months ago

ZiweiHe commented 9 months ago

Hi,

The results you gave here are problematic. Under the same directory, the results of 3 different attention types are identical.

tomaarsen commented 9 months ago

Hello!

They indeed start equivalent for the first 1024 tokens, but they differ after that. This is because the windowed and attention_sinks approaches in the benchmarks use a window size of 1024, before which the three approaches are identical. See for example token 4000 for evidence that the results are not identical:

Also, these figures in the README are direct plots of these .csv files: image As you can see, they're not identical.

I hope that clears it up!

Edit: If there is indeed a model for which they are exactly identical, then please let me know and I'll resolve it! I may have made a mistake at some point.

ZiweiHe commented 9 months ago

Oh my mistake, thank you for your reply. Plese feel free to delete this issue!

tomaarsen commented 9 months ago

No worries!