Generation uses config.max_seq_len instead of default 2048

turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

MIT License

2.67k stars 214 forks source link

Generation uses config.max_seq_len instead of default 2048 #236

Closed flotos closed 11 months ago

flotos commented 11 months ago

The max_seq_len was not used when calling generator.generate_simple() making it impossible to use with prompt bigger than 2048 tokens, for example on new Llama 2 that have 4096 context size.

turboderp commented 11 months ago

Yep, this looks like an oversight.