mistralai / mistral-inference

Official inference library for Mistral models
https://mistral.ai/
Apache License 2.0
9.37k stars 817 forks source link

Inquiry on Implementing Sliding Window Attention for Custom Sequence Lengths #86

Open yihong1120 opened 7 months ago

yihong1120 commented 7 months ago

Dear Mistral Transformer Team,

I hope this message finds you well. I have been exploring the capabilities of the Mistral 7B model and am particularly intrigued by the implementation of sliding window attention as a means to improve inference efficiency and reduce memory pressure. The concept of using a fixed-size window to manage the (key, value) cache is quite innovative, and I believe it has the potential to significantly enhance performance for long sequence tasks.

However, I am currently working with sequences of varying lengths that do not neatly fit into the pre-defined window sizes mentioned in your documentation. My objective is to adapt the sliding window attention mechanism to accommodate custom sequence lengths that may vary dynamically during runtime.

Could you provide guidance or best practices on how to modify the sliding window attention mechanism to handle variable sequence lengths? Specifically, I am interested in understanding:

  1. How to determine the optimal window size for a given sequence length to balance the trade-off between computational efficiency and context availability.
  2. The impact of sequence length variability on the rolling buffer cache and if there are any recommended strategies to manage the cache effectively in such scenarios.
  3. Any potential limitations or considerations to be aware of when implementing sliding window attention for custom sequence lengths.

I appreciate the work you have put into developing the Mistral Transformer and am excited about the possibility of integrating this feature into my own projects. Your insights on this matter would be invaluable.

Thank you for your time and assistance.

Best regards, yihong1120