This API allows caller to choose 1. how many slots to use as sinks, and 2. how much to trim the cache to.
Callers can pick a low number like in the paper, or something to keep the entire system command.
The typical sliding window approach would call this function after every append, and trim to max_window_size. For better performance, callers can trim more frequently and aggressively.
The TVM component to implementing Attention Sinks (https://arxiv.org/abs/2309.17453). See https://github.com/mlc-ai/mlc-llm/issues/1357
This API allows caller to choose 1. how many slots to use as sinks, and 2. how much to trim the cache to.