Closed cikkle closed 4 months ago
It might be that ROCm has slightly more variable overhead than CUDA. The loader will reserve 96 MB of VRAM for that by default, but I've never had any ROCm devices to actually test it on, so it's quite possible that's simply not enough.
At the top of model.py
, the reserved space is set by the variable auto_split_reserve_bytes
. Could you try increasing that to, say, 256 * 1024**2 or more and see if that makes a difference? If it works I can make it default behavior for ROCm, but sadly I have no way to test it myself at the moment.
No luck, pretty much the same error.
Successfully preprocessed all matching files.
Loading model: /home/o0/ai/tabbyAPI/models/Euryale-1.3-L2-70B-4.65bpw-h6-exl2
Modules |█████████████████▉ | 91/163Traceback (most recent call last):
File "/home/o0/ai/tabbyAPI/main.py", line 227, in <module>
for (module, modules) in load_status:
File "/home/o0/ai/tabbyAPI/model.py", line 171, in load_gen
yield from self.model.load_autosplit_gen(self.cache, reserve_vram = reserve, last_id_only = True, callback_gen = progress_callback)
File "/home/o0/miniconda3/lib/python3.11/site-packages/exllamav2/model.py", line 365, in load_autosplit_gen
hidden_state = module.forward(hidden_state, cache = cache, attn_mask = attn_mask, past_len = past_len, loras = loras)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/o0/miniconda3/lib/python3.11/site-packages/exllamav2/attn.py", line 376, in forward
if attn_mask is not None: attn_weights = attn_weights + attn_mask
~~~~~~~~~~~~~^~~~~~~~~~~
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 23.98 GiB of which 342.00 MiB is free. Of the allocated memory 23.16 GiB is allocated by PyTorch, and 204.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
I tried some higher values as well, but I'm still OOMing.
Ooooh, hmm! Actually it might just be down to the OoM exception looking different for ROCm.
The exception is caught during auto-loading to detect when one GPU is full and it's time to proceed to the next one. Which seems to be quite reliable, except the code looks like this:
except Exception as e:
test = 0
if "CUDA out of memory" in str(e):
fail = True # Exception object will hold references to tensors so we can't free them here
else:
raise
If you're compiling from source, you could try changing "CUDA" to "HIP" on line 371 of exllamav2/model.py, and that might do the trick.
Changing the string and reinstalling from source apparently lets it finish loading the model, but then it still throws an OOM exception anyway:
Loading model: /home/o0/ai/tabbyAPI/models/Euryale-1.3-L2-70B-4.65bpw-h6-exl2
Modules |████████████████████████████████| 163/163
Traceback (most recent call last):
File "/home/o0/ai/tabbyAPI/main.py", line 227, in <module>
for (module, modules) in load_status:
File "/home/o0/ai/tabbyAPI/model.py", line 175, in load_gen
self.model.forward(input_ids, cache = self.cache, preprocess_only = True)
File "/home/o0/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/o0/.local/lib/python3.11/site-packages/exllamav2-0.0.8-py3.11-linux-x86_64.egg/exllamav2/model.py", line 582, in forward
r, ls = self._forward(input_ids = input_ids[:, chunk_begin : chunk_end],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/o0/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/o0/.local/lib/python3.11/site-packages/exllamav2-0.0.8-py3.11-linux-x86_64.egg/exllamav2/model.py", line 655, in _forward
x = module.forward(x, cache = cache, attn_mask = attn_mask, past_len = past_len, loras = loras)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/o0/.local/lib/python3.11/site-packages/exllamav2-0.0.8-py3.11-linux-x86_64.egg/exllamav2/attn.py", line 376, in forward
if attn_mask is not None: attn_weights = attn_weights + attn_mask
~~~~~~~~~~~~~^~~~~~~~~~~
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 23.98 GiB of which 136.00 MiB is free. Of the allocated memory 23.29 GiB is allocated by PyTorch, and 265.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
Would this also be the case if you set auto_split_reserve_bytes
to some very high value like 1024**3 (1 GB), with the change to exception handling?
auto_split_reserve_bytes
in tabbyAPI's model.py, right?
This is working, and I still have the other change above applied. It's loading and splitting the model properly now without any errors, and I tested a few completions successfully.
experiencing the same issues with CUDA, only thing that seems to works is setting the gpu_split manually. I can't find the auto_split_reserve_bytes
arg in the repo...
@getorca Sorry, that was referring to the model.py
file in ExUI.
When calling model.load_autosplit
elsewhere, you can set the reserve_vram
argument, which is a list of sizes, in bytes, to reserve on each GPU. You can set it to something like [256*1024**2] * num_devices
to reserve 256 MB on each device, or [512*1024**2] + [64*1024**2] * (num_devices - 1)
to reserve 512 MB on the first device and 64 MB on the rest.
I've tested it a lot, but there's still a bit of fudging that goes into predicting exactly how much VRAM will be used on top of the weights. It could change slightly with a new release of flash-attn, for instance.
Loading a 70B / 4.625 bpw model on a dual 7900 XTX system (ROCm 5.7.1, amdgpu driver 6.2.4). Manually setting gpu_split to 22, 24 always loads fine, but 'auto' always results in an OOM error:
This is on a headless server with nothing else occupying VRAM. The settings are applied through the config file in tabbyAPI but I get the same behavior in ooba too when I don't specify a split and it falls back on auto, apparently by default.
Not sure what other relevant information I can provide.