Closed bugm closed 6 months ago
Hi,
Thanks for your query. In our implementation, we cache the KV states before positional encoding and reapply them during generation. This approach prevents complexity from increasing to O(TL^2). Please see our method here: Modify LLaMA.
Best, Guangxuan
Hi, Thanks for your nice work! As descried in the paper,
streaming-llm needs to change the positional assigns to the tokens (except the attention sinks ones) to ensuring that the model operates efficiently even beyond its pre-training attention window size.
In a normal way, since the positional encoding is changed, we could not use the previous KV cache (except the KV cache in first layer), because they (KV in 2nd layer to last layer) are calculated based on the previous positional encoding . But this will make the complexity becomes O(TL^2).
what do you think of this mismatching ? will it influence the performance?
if so, maybe we could apply some mechanism to improve it.