mlc-ai / relax

Apache License 2.0
137 stars 69 forks source link

Add Attention Sinks (TVM portion) #301

Closed kmn1024 closed 6 months ago

kmn1024 commented 7 months ago

The TVM component to implementing Attention Sinks (https://arxiv.org/abs/2309.17453). See https://github.com/mlc-ai/mlc-llm/issues/1357

This API allows caller to choose 1. how many slots to use as sinks, and 2. how much to trim the cache to.

  1. Callers can pick a low number like in the paper, or something to keep the entire system command.
  2. The typical sliding window approach would call this function after every append, and trim to max_window_size. For better performance, callers can trim more frequently and aggressively.