turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

test_inference.py: --low_mem is broken unless --max_output_len is also set #511

Closed IMbackK closed 1 week ago

IMbackK commented 1 week ago
% python test_inference.py -m CodeLlama-13B-GPTQ/ -p "int main(" -nfa -l 2048 -lm         
Finding flash_attn
NO flash_attn module
 -- Model: CodeLlama-13B-GPTQ/
 -- Options: ['length: 2048', 'no_flash_attn', 'low_mem']
Traceback (most recent call last):
  File "/home/philipp/machine-lerning/exllamav2/test_inference.py", line 95, in <module>
    model, tokenizer = model_init.init(args,
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/machine-lerning/exllamav2/exllamav2/model_init.py", line 101, in init
    if args.low_mem: config.set_low_mem()
                     ^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/machine-lerning/exllamav2/exllamav2/config.py", line 143, in set_low_mem
    self.max_output_len = min(self.max_output_len, 1024)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '<' not supported between instances of 'int' and 'NoneType'
turboderp commented 1 week ago

Thanks. Fixed in dev branch now.