turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

Cache size below max_seq_len? #259

Closed fahadh4ilyas closed 10 months ago

fahadh4ilyas commented 10 months ago

Is it possible to make ExLlamaCache shorter than intended max_seq_len? Because we know that when we set max_new_tokens when generating text, the maximum length needed for generation is shorter than max_seq_len. I feel like if we set ExLlamaCache longer than what we needed , for generation especially if we set max_seq_len high, the memory usage will be wasted for zero value.

turboderp commented 10 months ago

Yes, you can simply set max_seq_len to some smaller length when allocating the cache, i.e.:

    config = ExLlamaConfig(...)
    config.max_seq_len = 4096
    model = ExLlama(config)
    cache = ExLlamaCache(model, max_seq_len = 512)

There was a bug preventing this from working with fused attention (which is enabled by default), but the latest commit should fix that.

fahadh4ilyas commented 10 months ago

Already test it and it works like charms~