oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.87k stars 5.34k forks source link

Constant Out Of Memory errors despite having plenty of VRAM - AMD #6497

Open RSAStudioGames opened 1 month ago

RSAStudioGames commented 1 month ago

Describe the bug

I am running Llama 3.1 70B at 6.0BPW with ExLlamav2_HF loader, 64K context, no_flash_attn, and autosplit. I still have at least 20GB of VRAM leftover after fully loading the model with the above parameters.

I can send some messages to the AI in the chat tab at first, but as soon as the context passes 6 - 7K, it gives me an OOM error despite still having more than enough VRAM.

Is there an existing issue for this?

Reproduction

  1. Load Llama 3.1 70B model. - ExLlamav2_HF, 64K Context, no_flash_attn, and autosplit enabled.
  2. Send a large message of around 7K - 8K tokens in length, well within the limits the system can handle, to test.
  3. Out of memory

Screenshot

rocm-smi once the model is fully loaded image

Logs

16:36:40-789264 INFO     Loading "Meta-Llama-3.1-70B-Instruct-6.0bpw-h6-exl2"
16:36:42-092387 WARNING  Failed to load flash-attention due to the following error:

Traceback (most recent call last):
  File "/home/rsa/text-generation-webui/modules/exllamav2_hf.py", line 23, in <module>
    import flash_attn
ModuleNotFoundError: No module named 'flash_attn'
/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:600: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py:575: UserWarning: expandable_segments not supported on this platform (Triggered internally at ../c10/hip/HIPAllocatorConfig.h:29.)
  reserved_vram_tensors.append(torch.empty((b,), dtype = torch.int8, device = _torch_device(current_device)))
16:38:03-832021 INFO     Loaded "Meta-Llama-3.1-70B-Instruct-6.0bpw-h6-exl2" in 83.04 seconds.
16:38:03-833268 INFO     LOADER: "ExLlamav2_HF"
16:38:03-833786 INFO     TRUNCATION LENGTH: 65536
16:38:03-834231 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
Output generated in 8.11 seconds (5.92 tokens/s, 48 tokens, context 83, seed 614014744)
Traceback (most recent call last):
  File "/home/rsa/text-generation-webui/modules/callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rsa/text-generation-webui/modules/text_generation.py", line 398, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 2215, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 3206, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rsa/text-generation-webui/modules/exllamav2_hf.py", line 129, in __call__
    self.ex_model.forward(seq_tensor[longest_prefix:-1].view(1, -1), ex_cache, preprocess_only=True, loras=self.loras)
  File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 878, in forward
    r = self.forward_chunk(
        ^^^^^^^^^^^^^^^^^^^
  File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 984, in forward_chunk
    x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/attn.py", line 1102, in forward
    attn_output = attn_func(batch_size, q_len, q_states, k_states, v_states, attn_params, cfg)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/attn.py", line 856, in _attn_torch
    attn_output = F.scaled_dot_product_attention(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/attention/bias.py", line 281, in __torch_function__
    return cls._dispatch(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rsa/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/nn/attention/bias.py", line 258, in _dispatch
    return scaled_dot_product_attention(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 31.98 GiB of which 82.00 MiB is free. Of the allocated memory 30.73 GiB is allocated by PyTorch, and 792.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Output generated in 0.68 seconds (0.00 tokens/s, 0 tokens, context 8527, seed 1232530018)

System Info

OS: Ubuntu 22.04
GPU: 3x Radeon Instinct MI100 (32GB VRAM each)
CPU: AMD Epyc 9334
ROCM 6.1.2
Text Gen Web UI V1.16