Closed davidpissarra closed 8 months ago
Thanks @davidpissarra . The main comment is that given the impl is specialized to window, let us make sure tha API naming highlights the fact
please also send the PR to unity branch of https://github.com/apache/tvm
Part of the effort on Sliding Window Attention (SWA) https://github.com/mlc-ai/mlc-llm/issues/1003. Overwriting the cache is useful when computing SWA, so we can have a more efficient cache only containing the current window keys and values. Once the cache is full we start overwriting the older entries.