RuntimeError: Insufficient VRAM for model and cache using load_autosplit_gen

xu-jenny commented 3 months ago

Hi, just started using exllamav2 on a free tier colab and get this error when running examples/inference.py script.

RuntimeError                              Traceback (most recent call last)
[<ipython-input-28-595e4bd80d10>](https://localhost:8080/#) in <cell line: 12>()
     10 
     11 cache = ExLlamaV2Cache(model, lazy = True)
---> 12 model.load_autosplit(cache)
     13 
     14 tokenizer = ExLlamaV2Tokenizer(config)

1 frames
[/usr/local/lib/python3.10/dist-packages/exllamav2/model.py](https://localhost:8080/#) in load_autosplit_gen(self, cache, reserve_vram, last_id_only, callback, callback_gen)
    442                         current_device += 1
    443                         if current_device >= num_devices:
--> 444                             raise RuntimeError("Insufficient VRAM for model and cache")
    445 
    446                         continue

RuntimeError: Insufficient VRAM for model and cache

I was able to run the examples/chat.py script using the same model (LoneStriker/Mixtral_11Bx2_MoE_19B-3.0bpw-h6-exl2), the safetensor is only 3.86GB and the colab has 12GB RAM, I got the same error on an A6000 on paperspace as well. Don't expect this to be a gpu ram limitation, could be a bug in the code when using a single gpu?

I tried both an install from source and PyPi and had the same errors. Thanks in advance!

turboderp commented 3 months ago

It's a bit hard to say. Perhaps something went wrong with the download? The safetensors file for that model should be around 7 GB, not 3.86. Even so when I run it locally it only uses a little over 9 GB of VRAM. Is it possible you have other stuff in VRAM somehow? Not sure exactly how it's all managed in Colab.

As for Paperspace, I've never tried that so I'm not sure what sort of interaction it might have with ExLlama.

xu-jenny commented 3 months ago

Thanks for the reply, I tried again with an A6000 from paperspace (48GiB VRAM) and was able to get it to work with this model. I would love to try bigger models like LoneStriker/Smaug-34B-v0.1-3.0bpw-h6-exl2 (lowest precision for the model). However I run into a Cuda OutOfMemory error when running test_inference.py. There are only 13.83GB in the safetensors, so I didn't expect out of memory error to happen with a 48 GiB GPU. Curious if there's something else I'm doing wrong..

Here's a GPU info right before & after I loaded the model

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:00:05.0 Off |                  Off |
| 30%   59C    P8    26W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here's what I did:

Upgrade GPU cuda to from 11.6 to 12.4

Using instructions from nvidia's official docs:

!sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*"
!sudo apt-get --purge -y remove "*nvidia*" "libxnvctrl*"
!sudo apt-get -y autoremove

!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
!sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
!wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-ubuntu2004-12-4-local_12.4.0-550.54.14-1_amd64.deb
!sudo dpkg -i cuda-repo-ubuntu2004-12-4-local_12.4.0-550.54.14-1_amd64.deb
!sudo cp /var/cuda-repo-ubuntu2004-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
!sudo apt-get update
!sudo apt-get -y install cuda-toolkit-12-4

!echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64' >> ~/.bashrc
!echo 'export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/root/.local/bin:/usr/local/cuda-12.4/bin' >> ~/.bashrc
!source ~/.bashrc
!nvcc --version  #output matches expected

install & download dataset

!pip install -r exllamav2/requirements.txt
!cd exllamav2 && pip install -r .

from huggingface_hub import snapshot_download
model_name="LoneStriker/Smaug-34B-v0.1-3.0bpw-h6-exl2"
model_dir = snapshot_download(repo_id=model_name)
print(model_dir)

Out of memory issue

!python exllamav2/test_inference.py -m /root/.cache/huggingface/hub/models--LoneStriker--Mixtral_11Bx2_MoE_19B-3.0bpw-h6-exl2/snapshots/10b89eecc843b36fd59a2ed51363fbe6abb86b2a -p "Which country has the most and least population?"

Error:

 -- Model: /root/.cache/huggingface/hub/models--LoneStriker--Smaug-34B-v0.1-3.0bpw-h6-exl2/snapshots/3aa3086876e2f636d25944b5eae459543dc6b25b
 -- Options: []
 -- Loading model...
 -- Loaded model in 4.4778 seconds
 -- Loading tokenizer...
Traceback (most recent call last):
  File "/notebooks/exllamav2/test_inference.py", line 174, in <module>
    cache = ExLlamaV2Cache(model)
  File "/notebooks/exllamav2/exllamav2/cache.py", line 157, in __init__
    self.create_state_tensors(copy_from, lazy)
  File "/notebooks/exllamav2/exllamav2/cache.py", line 53, in create_state_tensors
    p_value_states = torch.zeros(self.batch_size, self.max_seq_len, self.num_key_value_heads, self.head_dim // self.weights_per_element, dtype = self.dtype, device = self.model.cache_map[i]).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 392.00 MiB. GPU 0 has a total capacity of 47.54 GiB of which 249.12 MiB is free. Process 3791038 has 47.29 GiB memory in use. Of the allocated memory 46.73 GiB is allocated by PyTorch, and 245.41 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management

Thanks so much for taking a look! This is my first time using exllamav2 so would really appreciate some help! I was able to load the GPTQ and AWQ versions and saw the exl2 is suppose to be less memory intensive and faster inference, so was confused why this couldn't be loaded.

turboderp commented 3 months ago

If you're not passing any extra arguments (-l specifically) it will try to allocate space for the model's full context length. Which in this case, Judging by the config, is 200k tokens. In general, the VRAM needed per cached token works out to:

head_dim (i.e. hidden_size / num_attention_heads) * num_key_value_heads * num_hidden_layers * 2 (for keys + values) * bytes_per_element (2 for FP16, 1 for FP8, ~0.56 for Q4)

So for this model, that's 128 8 60 2 2 = 240 kB per token.

At 200k tokens, that's around 46 GB in total, which is on top of the quantized weights and buffers for activations and stuff. You should be able to run the model with a reduced context length (-l 100000 or something) or in Q4 cache mode (-cq4).

turboderp / exllamav2