Does the discontinuous positional encodings confuse the model？

mit-han-lab / Quest

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

201 stars 18 forks source link

Does the discontinuous positional encodings confuse the model？ #6

Open ovowei opened 4 months ago

ovowei commented 4 months ago

Hi,

I was reading your paper and have a question about the positional encodings. In my understanding, performing attention only on selected pages leads to selecting discontinuous pages, resulting in discontinuous positional encodings. LM-Infinite and StreamingLLM directly assign continuous positional encodings or assign the same positional encodings to all tokens beyond the local window size to handle this. Does Quest need similar processing?

Thanks!

Sakits commented 3 months ago

Hi @ovowei ,

Thank you for your interest in our work! Quest does not employ a similar process to LM-Infinite or StreamingLLM. Instead, Quest directly applies the original positional embeddings to the selected pages. Here are the reasons for this approach:

The pages selected by Quest have accumulated sufficient attention scores (>99%). This can be viewed as pruning the unselected KV caches, which works well in the evaluations.
We have experimented with applying continuous positional encodings to the selected pages. However, this approach performed worse than the original method. Unlike StreamingLLM, where only a few attention sinks are discontinuous with the recent token window, Quest deals with more substantial discontinuities. Thus, we found that using continuous positional encodings for these discontinuous pages in Quest did not yield better results.

ovowei commented 3 months ago

Hi @Sakits

Thanks for your answers. It makes sense to me.

I used a model pre-trained on shorter sequence datasets to process longer sequence tasks. I found that applying QUEST and assigning the same positional encodings to all tokens beyond a certain distance yields better results in this case. This suggests that QUEST might help models process extremely long sequences. I will conduct more experiments to verify this. If you have conducted similar experiments, I would appreciate it if you could share your results.

Thanks!

Sakits commented 3 weeks ago

Hi @ovowei ,

Sorry for the delayed reply! I’ve been busy working on a paper submission recently. Thank you for sharing your insights and interesting discussions! :)

Yes, we also found that assigning the same positional encodings to tokens beyond a certain distance can somehow extend the model’s effective context range. There are some interesting works that discuss similar ideas, such as InfLLM and LongHeads. However, with more and more models offering extended context windows (up to 128k~10M tokens), modifying positional encodings in this way might not be as necessary as before.

Thank you again for your interest in our work!