turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

max_attention_size should be max_input_len**2 ? #458

Closed laoda513 closed 1 month ago

laoda513 commented 1 month ago

I am quite confused in this place.

                 model_dir: str | None = None):
        """
        :param model_dir:
            If specified, initialize ExLlamaV2Config with values read from model config.
        """

        self.max_batch_size = 1
        self.max_input_len = 2048
        self.max_attention_size = 2048**2

Does the max_attention_size always be max_input_len2? Would it be more robost to set
`self.max_attention_size = self.max_input_len
2` if I try to use a different max_input_len ?

turboderp commented 1 month ago

It doesn't have to be, no. It can be greater to maintain the chunk size with a longer prefix. It's the maximum of q_len * k_len, and since k_len grows the longer the context is, it forces smaller and smaller chunks as the sequence grows, in order to maintain a constant size for the attention weights.

The value is ignored when using flash-attn, except it's still used to guesstimate the memory requirement when loading a model with manual split.