turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.68k stars 282 forks source link

[BUG] Out of memory from a 2.4bpw 70B parameter model #677

Open cmunna0052 opened 15 hours ago

cmunna0052 commented 15 hours ago

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Pytorch version

2.4.1

Model

LoneStriker/Meta-Llama-3.1-70B-Instruct-2.4bpw-h6-exl2

Describe the bug

The test script is yielding an out of memory error on a model that should be well within the limit of a 46GB EC2 instance:

 python exllamav2/test_inference.py -m Meta-Llama-3.1-70B-Instruct-2.4bpw-h6-exl2 -p "Once upon a time,"
 -- Model: Meta-Llama-3.1-70B-Instruct-2.4bpw-h6-exl2
 -- Options: []
Loading: Meta-Llama-3.1-70B-Instruct-2.4bpw-h6-exl2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:09 0:00:00
 -- Loaded model in 9.7356 seconds
 -- Loading tokenizer...
Traceback (most recent call last):
  File "/home/ubuntu/exllamav2/test_inference.py", line 192, in <module>
    cache = ExLlamaV2Cache(model) if not model.tp_context else ExLlamaV2Cache_TP(model)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/exllamav2/exllamav2/cache.py", line 256, in __init__
    self.create_state_tensors(copy_from, lazy)
  File "/home/ubuntu/exllamav2/exllamav2/cache.py", line 91, in create_state_tensors
    p_key_states = torch.zeros(self.shape_wk, dtype = self.dtype, device = device).contiguous()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU 0 has a total capacity of 44.53 GiB of which 136.44 MiB is free. Including non-PyTorch memory, this process has 44.38 GiB memory in use. Of the allocated memory 43.62 GiB is allocated by PyTorch, and 275.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Reproduction steps

Download the model:

git lfs install
git clone --depth 1 https://huggingface.co/LoneStriker/Meta-Llama-3.1-70B-Instruct-2.4bpw-h6-exl2 

Run the script

 python exllamav2/test_inference.py -m Meta-Llama-3.1-70B-Instruct-2.4bpw-h6-exl2 -p "Once upon a time,"

Expected behavior

The model would run and generate.

Logs

No response

Additional context

No response

Acknowledgements

DocShotgun commented 13 hours ago

Llama 3.1 has a default context size of 131072 tokens, which will consume a considerable amount of VRAM for the cache. Have you tried loading it with a smaller sequence length?

cmunna0052 commented 3 hours ago

I don't understand -- shouldn't the context length be set by the length of the prompt + the number of newly generated tokens? I expected it to be very short because it only starts with "Once upon a time" and "tokens" is set to 128