turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 236 forks source link

understanding config.max_input_len #348

Closed bdambrosio closed 4 months ago

bdambrosio commented 4 months ago

I just discovered this exllamav2 parameter, and am trying to understand implications. I saw in the quantize documentation that the default is 2048, with a comment that if input is longer, it will be processed in parts. This parameter seems to affect inference with a quantized model as well? ( at least, it seems to affect vram needed).

I work with large contexts, often 6-8k tokens. Usually using a Mixtral quant.

Should I set this parameter in my inference config? Does it matter if quant was done with default 2k? I can quantize locally with a larger parameter value if needed.

tnx!

turboderp commented 4 months ago

The value won't affect the output or the maximum sequence length you can process. What it changes is how many tokens will be processed in one forward pass when doing inference over a longer sequence, typically in prompt ingestion.

A lower value saves more a bit more memory but also slows down prompt ingestion. Increasing it does the opposite, and I find that around 2048 is usually the best tradeoff but feel free to experiment with other ranges.

You'd probably want to use it in conjunction with the max_attention_size parameter which limits how large the attention matrix can be, since memory usage otherwise scales quadratically with sequence length.

bdambrosio commented 4 months ago

Ah. Thanks!