Closed ZiweiHe closed 9 months ago
Hello!
They indeed start equivalent for the first 1024 tokens, but they differ after that. This is because the windowed
and attention_sinks
approaches in the benchmarks use a window size of 1024, before which the three approaches are identical. See for example token 4000 for evidence that the results are not identical:
Also, these figures in the README are direct plots of these .csv files: As you can see, they're not identical.
I hope that clears it up!
Edit: If there is indeed a model for which they are exactly identical, then please let me know and I'll resolve it! I may have made a mistake at some point.
Oh my mistake, thank you for your reply. Plese feel free to delete this issue!
No worries!
Hi,
The results you gave here are problematic. Under the same directory, the results of 3 different attention types are identical.