No matter what I try I can't set the context_length of a GPTQ model. It's overridden by ExLLAMA, which then sets the cache size and context_length whatever it set as default (in this case 2048).
First problem is that its actually using max_seq_len to set the context_length and the Config dataclass doesn't include that field. Even if I monkey patch the config dataclass and set the Config
No matter what I try I can't set the context_length of a GPTQ model. It's overridden by ExLLAMA, which then sets the cache size and context_length whatever it set as default (in this case 2048).
First problem is that its actually using max_seq_len to set the context_length and the Config dataclass doesn't include that field. Even if I monkey patch the config dataclass and set the Config
None of these will change the context_length used by the GPTQ model because it uses the ExLLAMA config instead.
If I reach in and modify the ExLLAMA config after loading the model via
It correctly sets the context_length that but its already allocated a cache size at 2048 and promptly crashes whenever you ask for a long response.