Closed laoda513 closed 1 month ago
It doesn't have to be, no. It can be greater to maintain the chunk size with a longer prefix. It's the maximum of q_len * k_len, and since k_len grows the longer the context is, it forces smaller and smaller chunks as the sequence grows, in order to maintain a constant size for the attention weights.
The value is ignored when using flash-attn, except it's still used to guesstimate the memory requirement when loading a model with manual split.
I am quite confused in this place.
Does the max_attention_size always be max_input_len2? Would it be more robost to set
`self.max_attention_size = self.max_input_len2` if I try to use a different max_input_len ?