Open Harleen8118 opened 9 months ago
Sliding attention - in principle - just allows the model to run faster by only considering a portion of the input context. If anything, it technically degrades performance rather than improves it.
In practise, the sliding attention hasn't really been used by Mistral and most Mistral models seem to effectively set the sliding window length to the full input context length...
Please correct me someone if I'm mistaken on this.
I'm not really sure but here's why I thought it will be better: 1) Extending full attention has proven to be ineffective, as the model only tend to remember first and last parts of the tokens. 2) Sliding windows can fix this by Sliding through the tokens and considering specific parts of the input tokens while predicting next output token. This should result in better RAG and prompt following in general.
Please correct me if I'm misinterpreting somethings. Thanks )
Properly trained models are able to do passkey retrieval on all of the text. Test GPT4 or DeepSeek Coder or Yi models and they all succeed. They use standard attention (no sliding window).
By definition, sliding window attends to less of the text than full context, so it performs worse on perplexity but is faster to inference.
I agree with @RonanKMcGovern here on the effectiveness of sliding window attention (even though I have not done an apple-to-apple comparison.) Would appreciate it if someone could submit a PR of the sliding window so I can then find some time to run some small-scale experiments.
Sliding window context is one of the tricks that makes Mistral 7b so good at prompt following. Can you add this sliding window to tiny llama's architecture?