I am running Llama 3.1 70B at 6.0BPW with ExLlamav2_HF loader, 64K context, no_flash_attn, and autosplit.
I still have at least 20GB of VRAM leftover after fully loading the model with the above parameters.
I can send some messages to the AI in the chat tab at first, but as soon as the context passes 6 - 7K, it gives me an OOM error despite still having more than enough VRAM.
Send a large message of around 7K - 8K tokens in length, well within the limits the system can handle, to test.
Out of memory
Screenshot
rocm-smi once the model is fully loaded
Logs
16:36:40-789264 INFO Loading "Meta-Llama-3.1-70B-Instruct-6.0bpw-h6-exl2"
16:36:42-092387 WARNING Failed to load flash-attention due to the following error:
Traceback (most recent call last):
File "/home/rsa/text-generation-webui/modules/exllamav2_hf.py", line 23, in <module>
import flash_attn
ModuleNotFoundError: No module named 'flash_attn'
/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:600: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
warnings.warn(
/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py:575: UserWarning: expandable_segments not supported on this platform (Triggered internally at ../c10/hip/HIPAllocatorConfig.h:29.)
reserved_vram_tensors.append(torch.empty((b,), dtype = torch.int8, device = _torch_device(current_device)))
16:38:03-832021 INFO Loaded "Meta-Llama-3.1-70B-Instruct-6.0bpw-h6-exl2" in 83.04 seconds.
16:38:03-833268 INFO LOADER: "ExLlamav2_HF"
16:38:03-833786 INFO TRUNCATION LENGTH: 65536
16:38:03-834231 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
Output generated in 8.11 seconds (5.92 tokens/s, 48 tokens, context 83, seed 614014744)
Traceback (most recent call last):
File "/home/rsa/text-generation-webui/modules/callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/modules/text_generation.py", line 398, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 2215, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 3206, in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/modules/exllamav2_hf.py", line 129, in __call__
self.ex_model.forward(seq_tensor[longest_prefix:-1].view(1, -1), ex_cache, preprocess_only=True, loras=self.loras)
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 878, in forward
r = self.forward_chunk(
^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 984, in forward_chunk
x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/attn.py", line 1102, in forward
attn_output = attn_func(batch_size, q_len, q_states, k_states, v_states, attn_params, cfg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/attn.py", line 856, in _attn_torch
attn_output = F.scaled_dot_product_attention(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/attention/bias.py", line 281, in __torch_function__
return cls._dispatch(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/attention/bias.py", line 258, in _dispatch
return scaled_dot_product_attention(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 31.98 GiB of which 82.00 MiB is free. Of the allocated memory 30.73 GiB is allocated by PyTorch, and 792.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Output generated in 0.68 seconds (0.00 tokens/s, 0 tokens, context 8527, seed 1232530018)
System Info
OS: Ubuntu 22.04
GPU: 3x Radeon Instinct MI100 (32GB VRAM each)
CPU: AMD Epyc 9334
ROCM 6.1.2
Text Gen Web UI V1.16
Describe the bug
I am running Llama 3.1 70B at 6.0BPW with ExLlamav2_HF loader, 64K context, no_flash_attn, and autosplit. I still have at least 20GB of VRAM leftover after fully loading the model with the above parameters.
I can send some messages to the AI in the chat tab at first, but as soon as the context passes 6 - 7K, it gives me an OOM error despite still having more than enough VRAM.
Is there an existing issue for this?
Reproduction
Screenshot
rocm-smi once the model is fully loaded
Logs
System Info