turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.56k stars 273 forks source link

OOM with gpu_split_auto, must specify split manually #166

Closed cikkle closed 4 months ago

cikkle commented 11 months ago

Loading a 70B / 4.625 bpw model on a dual 7900 XTX system (ROCm 5.7.1, amdgpu driver 6.2.4). Manually setting gpu_split to 22, 24 always loads fine, but 'auto' always results in an OOM error:

Successfully preprocessed all matching files.
Loading model: /home/o0/ai/tabbyAPI/models/Euryale-1.3-L2-70B-4.65bpw-h6-exl2
Modules |██████████████████▎             | 93/163Traceback (most recent call last):
  File "/home/o0/ai/tabbyAPI/main.py", line 227, in <module>
    for (module, modules) in load_status:
  File "/home/o0/ai/tabbyAPI/model.py", line 170, in load_gen
    yield from self.model.load_autosplit_gen(self.cache, reserve_vram = reserve, last_id_only = True, callback_gen = progress_callback)
  File "/home/o0/miniconda3/lib/python3.11/site-packages/exllamav2/model.py", line 365, in load_autosplit_gen
    hidden_state = module.forward(hidden_state, cache = cache, attn_mask = attn_mask, past_len = past_len, loras = loras)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/o0/miniconda3/lib/python3.11/site-packages/exllamav2/attn.py", line 376, in forward
    if attn_mask is not None: attn_weights = attn_weights + attn_mask
                                             ~~~~~~~~~~~~~^~~~~~~~~~~
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 23.98 GiB of which 0 bytes is free. Of the allocated memory 23.50 GiB is allocated by PyTorch, and 212.68 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

This is on a headless server with nothing else occupying VRAM. The settings are applied through the config file in tabbyAPI but I get the same behavior in ooba too when I don't specify a split and it falls back on auto, apparently by default.

Not sure what other relevant information I can provide.

turboderp commented 11 months ago

It might be that ROCm has slightly more variable overhead than CUDA. The loader will reserve 96 MB of VRAM for that by default, but I've never had any ROCm devices to actually test it on, so it's quite possible that's simply not enough.

At the top of model.py, the reserved space is set by the variable auto_split_reserve_bytes. Could you try increasing that to, say, 256 * 1024**2 or more and see if that makes a difference? If it works I can make it default behavior for ROCm, but sadly I have no way to test it myself at the moment.

cikkle commented 11 months ago

No luck, pretty much the same error.

Successfully preprocessed all matching files.
Loading model: /home/o0/ai/tabbyAPI/models/Euryale-1.3-L2-70B-4.65bpw-h6-exl2
Modules |█████████████████▉              | 91/163Traceback (most recent call last):
  File "/home/o0/ai/tabbyAPI/main.py", line 227, in <module>
    for (module, modules) in load_status:
  File "/home/o0/ai/tabbyAPI/model.py", line 171, in load_gen
    yield from self.model.load_autosplit_gen(self.cache, reserve_vram = reserve, last_id_only = True, callback_gen = progress_callback)
  File "/home/o0/miniconda3/lib/python3.11/site-packages/exllamav2/model.py", line 365, in load_autosplit_gen
    hidden_state = module.forward(hidden_state, cache = cache, attn_mask = attn_mask, past_len = past_len, loras = loras)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/o0/miniconda3/lib/python3.11/site-packages/exllamav2/attn.py", line 376, in forward
    if attn_mask is not None: attn_weights = attn_weights + attn_mask
                                             ~~~~~~~~~~~~~^~~~~~~~~~~
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 23.98 GiB of which 342.00 MiB is free. Of the allocated memory 23.16 GiB is allocated by PyTorch, and 204.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

I tried some higher values as well, but I'm still OOMing.

turboderp commented 11 months ago

Ooooh, hmm! Actually it might just be down to the OoM exception looking different for ROCm.

The exception is caught during auto-loading to detect when one GPU is full and it's time to proceed to the next one. Which seems to be quite reliable, except the code looks like this:

                    except Exception as e:

                        test = 0
                        if "CUDA out of memory" in str(e):
                            fail = True  # Exception object will hold references to tensors so we can't free them here
                        else:
                            raise

If you're compiling from source, you could try changing "CUDA" to "HIP" on line 371 of exllamav2/model.py, and that might do the trick.

cikkle commented 11 months ago

Changing the string and reinstalling from source apparently lets it finish loading the model, but then it still throws an OOM exception anyway:

Loading model: /home/o0/ai/tabbyAPI/models/Euryale-1.3-L2-70B-4.65bpw-h6-exl2
Modules |████████████████████████████████| 163/163
Traceback (most recent call last):
  File "/home/o0/ai/tabbyAPI/main.py", line 227, in <module>
    for (module, modules) in load_status:
  File "/home/o0/ai/tabbyAPI/model.py", line 175, in load_gen
    self.model.forward(input_ids, cache = self.cache, preprocess_only = True)
  File "/home/o0/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/o0/.local/lib/python3.11/site-packages/exllamav2-0.0.8-py3.11-linux-x86_64.egg/exllamav2/model.py", line 582, in forward
    r, ls = self._forward(input_ids = input_ids[:, chunk_begin : chunk_end],
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/o0/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/o0/.local/lib/python3.11/site-packages/exllamav2-0.0.8-py3.11-linux-x86_64.egg/exllamav2/model.py", line 655, in _forward
    x = module.forward(x, cache = cache, attn_mask = attn_mask, past_len = past_len, loras = loras)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/o0/.local/lib/python3.11/site-packages/exllamav2-0.0.8-py3.11-linux-x86_64.egg/exllamav2/attn.py", line 376, in forward
    if attn_mask is not None: attn_weights = attn_weights + attn_mask
                                             ~~~~~~~~~~~~~^~~~~~~~~~~
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 23.98 GiB of which 136.00 MiB is free. Of the allocated memory 23.29 GiB is allocated by PyTorch, and 265.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
turboderp commented 11 months ago

Would this also be the case if you set auto_split_reserve_bytes to some very high value like 1024**3 (1 GB), with the change to exception handling?

cikkle commented 10 months ago

auto_split_reserve_bytes in tabbyAPI's model.py, right?

This is working, and I still have the other change above applied. It's loading and splitting the model properly now without any errors, and I tested a few completions successfully.

getorca commented 10 months ago

experiencing the same issues with CUDA, only thing that seems to works is setting the gpu_split manually. I can't find the auto_split_reserve_bytes arg in the repo...

turboderp commented 10 months ago

@getorca Sorry, that was referring to the model.py file in ExUI.

When calling model.load_autosplit elsewhere, you can set the reserve_vram argument, which is a list of sizes, in bytes, to reserve on each GPU. You can set it to something like [256*1024**2] * num_devices to reserve 256 MB on each device, or [512*1024**2] + [64*1024**2] * (num_devices - 1) to reserve 512 MB on the first device and 64 MB on the rest.

I've tested it a lot, but there's still a bit of fudging that goes into predicting exactly how much VRAM will be used on top of the weights. It could change slightly with a new release of flash-attn, for instance.