Closed xu-jenny closed 3 months ago
It's a bit hard to say. Perhaps something went wrong with the download? The safetensors file for that model should be around 7 GB, not 3.86. Even so when I run it locally it only uses a little over 9 GB of VRAM. Is it possible you have other stuff in VRAM somehow? Not sure exactly how it's all managed in Colab.
As for Paperspace, I've never tried that so I'm not sure what sort of interaction it might have with ExLlama.
Thanks for the reply, I tried again with an A6000 from paperspace (48GiB VRAM) and was able to get it to work with this model. I would love to try bigger models like LoneStriker/Smaug-34B-v0.1-3.0bpw-h6-exl2 (lowest precision for the model).
However I run into a Cuda OutOfMemory error when running test_inference.py
. There are only 13.83GB in the safetensors, so I didn't expect out of memory error to happen with a 48 GiB GPU. Curious if there's something else I'm doing wrong..
Here's a GPU info right before & after I loaded the model
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:00:05.0 Off | Off |
| 30% 59C P8 26W / 300W | 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Here's what I did:
Using instructions from nvidia's official docs:
!sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*"
!sudo apt-get --purge -y remove "*nvidia*" "libxnvctrl*"
!sudo apt-get -y autoremove
!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
!sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
!wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-ubuntu2004-12-4-local_12.4.0-550.54.14-1_amd64.deb
!sudo dpkg -i cuda-repo-ubuntu2004-12-4-local_12.4.0-550.54.14-1_amd64.deb
!sudo cp /var/cuda-repo-ubuntu2004-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
!sudo apt-get update
!sudo apt-get -y install cuda-toolkit-12-4
!echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64' >> ~/.bashrc
!echo 'export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/root/.local/bin:/usr/local/cuda-12.4/bin' >> ~/.bashrc
!source ~/.bashrc
!nvcc --version #output matches expected
!pip install -r exllamav2/requirements.txt
!cd exllamav2 && pip install -r .
from huggingface_hub import snapshot_download
model_name="LoneStriker/Smaug-34B-v0.1-3.0bpw-h6-exl2"
model_dir = snapshot_download(repo_id=model_name)
print(model_dir)
!python exllamav2/test_inference.py -m /root/.cache/huggingface/hub/models--LoneStriker--Mixtral_11Bx2_MoE_19B-3.0bpw-h6-exl2/snapshots/10b89eecc843b36fd59a2ed51363fbe6abb86b2a -p "Which country has the most and least population?"
Error:
-- Model: /root/.cache/huggingface/hub/models--LoneStriker--Smaug-34B-v0.1-3.0bpw-h6-exl2/snapshots/3aa3086876e2f636d25944b5eae459543dc6b25b
-- Options: []
-- Loading model...
-- Loaded model in 4.4778 seconds
-- Loading tokenizer...
Traceback (most recent call last):
File "/notebooks/exllamav2/test_inference.py", line 174, in <module>
cache = ExLlamaV2Cache(model)
File "/notebooks/exllamav2/exllamav2/cache.py", line 157, in __init__
self.create_state_tensors(copy_from, lazy)
File "/notebooks/exllamav2/exllamav2/cache.py", line 53, in create_state_tensors
p_value_states = torch.zeros(self.batch_size, self.max_seq_len, self.num_key_value_heads, self.head_dim // self.weights_per_element, dtype = self.dtype, device = self.model.cache_map[i]).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 392.00 MiB. GPU 0 has a total capacity of 47.54 GiB of which 249.12 MiB is free. Process 3791038 has 47.29 GiB memory in use. Of the allocated memory 46.73 GiB is allocated by PyTorch, and 245.41 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management
Thanks so much for taking a look! This is my first time using exllamav2 so would really appreciate some help! I was able to load the GPTQ and AWQ versions and saw the exl2 is suppose to be less memory intensive and faster inference, so was confused why this couldn't be loaded.
If you're not passing any extra arguments (-l
specifically) it will try to allocate space for the model's full context length. Which in this case, Judging by the config, is 200k tokens. In general, the VRAM needed per cached token works out to:
head_dim (i.e. hidden_size / num_attention_heads) * num_key_value_heads * num_hidden_layers * 2 (for keys + values) * bytes_per_element (2 for FP16, 1 for FP8, ~0.56 for Q4)
So for this model, that's 128 8 60 2 2 = 240 kB per token.
At 200k tokens, that's around 46 GB in total, which is on top of the quantized weights and buffers for activations and stuff. You should be able to run the model with a reduced context length (-l 100000
or something) or in Q4 cache mode (-cq4
).
Hi, just started using exllamav2 on a free tier colab and get this error when running
examples/inference.py
script.I was able to run the
examples/chat.py
script using the same model (LoneStriker/Mixtral_11Bx2_MoE_19B-3.0bpw-h6-exl2), the safetensor is only 3.86GB and the colab has 12GB RAM, I got the same error on an A6000 on paperspace as well. Don't expect this to be a gpu ram limitation, could be a bug in the code when using a single gpu?I tried both an install from source and PyPi and had the same errors. Thanks in advance!