jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.71k stars 453 forks source link

Add support for sliding window context similar to Mistral? #124

Open Harleen8118 opened 9 months ago

Harleen8118 commented 9 months ago

Sliding window context is one of the tricks that makes Mistral 7b so good at prompt following. Can you add this sliding window to tiny llama's architecture?

RonanKMcGovern commented 9 months ago

Sliding attention - in principle - just allows the model to run faster by only considering a portion of the input context. If anything, it technically degrades performance rather than improves it.

In practise, the sliding attention hasn't really been used by Mistral and most Mistral models seem to effectively set the sliding window length to the full input context length...

Please correct me someone if I'm mistaken on this.

Harleen8118 commented 9 months ago

I'm not really sure but here's why I thought it will be better: 1) Extending full attention has proven to be ineffective, as the model only tend to remember first and last parts of the tokens. 2) Sliding windows can fix this by Sliding through the tokens and considering specific parts of the input tokens while predicting next output token. This should result in better RAG and prompt following in general.

Please correct me if I'm misinterpreting somethings. Thanks )

RonanKMcGovern commented 9 months ago

Properly trained models are able to do passkey retrieval on all of the text. Test GPT4 or DeepSeek Coder or Yi models and they all succeed. They use standard attention (no sliding window).

By definition, sliding window attends to less of the text than full context, so it performs worse on perplexity but is faster to inference.

jzhang38 commented 9 months ago

I agree with @RonanKMcGovern here on the effectiveness of sliding window attention (even though I have not done an apple-to-apple comparison.) Would appreciate it if someone could submit a PR of the sliding window so I can then find some time to run some small-scale experiments.