Closed fahadh4ilyas closed 10 months ago
Yes, you can simply set max_seq_len
to some smaller length when allocating the cache, i.e.:
config = ExLlamaConfig(...)
config.max_seq_len = 4096
model = ExLlama(config)
cache = ExLlamaCache(model, max_seq_len = 512)
There was a bug preventing this from working with fused attention (which is enabled by default), but the latest commit should fix that.
Already test it and it works like charms~
Is it possible to make ExLlamaCache shorter than intended max_seq_len? Because we know that when we set max_new_tokens when generating text, the maximum length needed for generation is shorter than max_seq_len. I feel like if we set ExLlamaCache longer than what we needed , for generation especially if we set max_seq_len high, the memory usage will be wasted for zero value.