turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 272 forks source link

MemoryError on TinyLlama and Llama-70B-chat #375

Closed MarcusGrattan closed 6 months ago

MarcusGrattan commented 6 months ago

When trying to run both the quantized 2.30bit Llama-70B-chat-exl2 and the quantized TinyLlama with the test_inference.py script I get the following output.

python test_inference.py -m ../models/Llama2-70B-chat-exl2 -p "Once upon a time,"

 -- Model: ../models/Llama2-70B-chat-exl2
 -- Options: []
Traceback (most recent call last):
  File "/home/margr792/kandidat/llm-annotate/exllamav2/test_inference.py", line 86, in <module>
    model, tokenizer = model_init.init(args, allow_auto_split = True, skip_load = args.stream_layers, benchmark = True)
  File "/home/margr792/kandidat/llm-annotate/exllamav2/exllamav2/model_init.py", line 82, in init
    config.prepare()
  File "/home/margr792/kandidat/llm-annotate/exllamav2/exllamav2/config.py", line 156, in prepare
    f = STFile.open(st_file, fast = self.fasttensors)
  File "/home/margr792/kandidat/llm-annotate/exllamav2/exllamav2/fasttensors.py", line 65, in open
    return STFile(filename, fast)
  File "/home/margr792/kandidat/llm-annotate/exllamav2/exllamav2/fasttensors.py", line 47, in __init__
    self.read_dict()
  File "/home/margr792/kandidat/llm-annotate/exllamav2/exllamav2/fasttensors.py", line 76, in read_dict
    header_json = fp.read(header_size)
MemoryError

I followed the installation instructions in the README (using conda as my environment manager). I'm wondering if anybody knows what may be causing the issue for me or point me in the right direction.

GPU: RTX 4090 24GB RAM: 64GB OS: Mint 21.3

Any help would greatly be appreciated.

MarcusGrattan commented 6 months ago

Fixed the issue. There was a problem with git lfs not downloading the files correctly.

talwrii commented 3 months ago

In my case the problem was that I had not checked out git large file system files with git lfs checkout. Before this the safetensors file was a short file yaml file.