turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 272 forks source link

Really high system RAM + slow load #371

Closed NinjaPerson24119 closed 3 months ago

NinjaPerson24119 commented 6 months ago

Using 2x 7900 XTX on EndeavourOS + pytorch nightly for ROCm 6.0

When I try to load a 70B model ~ 40GB, my system stalls out. I see the system RAM max out at ~30/32GB, which doesn't make a lot of sense.

Based on the high system RAM usage, I think I'd basically need enough system memory equalling VRAM to load the model.

But I thought the loader was supposed to transfer directly to the GPU.

A slightly smaller model (Euryale) loads in ~172.2432s from an SSD. Tried to load XWin 4bpw, which is slightly larger, and it just hangs forever.

turboderp commented 6 months ago

Yeah, safetensors is weird sometimes.

It doesn't transfer directly to the GPU but it uses memory mapping and that should leave it to the OS to not map in too much of the file at once. I guess something could be going wrong in coordinating everything between ROCm, Torch and the OS, but it's really hard to say. Do you notice any swap usage while it's loading?

One thing you could try would be using the fasttensors option in the model config. This uses a different code path, avoiding the safetensors library and loading as directly as possible via direct IO and pinned memory. It's a bit experimental, though.

I can't tell if 172s is abnormal. For a 70B model and a SATA SSD it sounds about right, but then a slightly larger model should only take slightly longer.

NinjaPerson24119 commented 6 months ago

No swap usage, just steady increasing system RAM usage.

It's good to note that llama.cpp also hangs when loading unless memory mapping is turned off.

OK I'll give that a try. Would rather avoid needing to upgrade to 64GB system RAM.

edit: actually, the swap might be rising a tad, but it's hard to tell because it loads so slowly

NinjaPerson24119 commented 6 months ago

--fast_safetensors arg seems to fix it. I'll have to figure out where Oobabooga sets that and add a shim for now, thanks :tada:

NinjaPerson24119 commented 6 months ago

Same machine loads in about 10s with a larger ram kit on the normal loading codepath. So the fast_safetensors isn't nearly as fast, ironically.

turboderp commented 6 months ago

It depends on your bandwidth and other factors. fast_safetensors uses direct I/O and bypasses any caching normally done by the system. The main purpose is to stream quickly from very fast storage since I'm loading a lot of models very often from an NVMe array that can theoretically push 14 GB/s. fast_safetensors achieves about 10 GB/s on this array with a cold cache (likely bottlenecked by my GPUs), while regular memory-mapped I/O is limited by the kernel to less than half that speed.

YMMV for other hardware, but at least you can avoid safetensors this way, as an option when it's acting up.

lufixSch commented 6 months ago

I'm observing the same on my AMD GPU during model loading. It even goes so far, that sometimes the process is killed by the OS because it fills up all my RAM.

I'm pretty sure, something changed in the last weeks because it only started crashing 1 or 2 weeks ago. Is it possible, that ROCm 6.0 or the related pytorch updates increased the RAM usage?

@NinjaPerson24119 did you find out where to set --fast_safetensors in Oobabooga? I would like to try that too.

turboderp commented 6 months ago

I'm thinking the underlying issue is either the safetensors library or some interaction with ROCm. There are other reports of it using too much system memory in some situations and not others. I'll look into some third option for loading tensors that avoids memory mapping altogether but perhaps without direct I/O and pinned memory.

@lufixSch If you can find the Python sources for ExLlamaV2 in TGW's env, you can try forcing the option on by editing exllamav2/config.py. Change line 21 from:

fasttensors: bool = False

to

fasttensors: bool = True
lufixSch commented 6 months ago

@turboderp Thanks! That works. I finally don't have to close my browser anymore to load Mixtral xD.

For me I see no difference in loading speed between fasttensors = False and fasttensore = True.

ghost commented 6 months ago

Had a similar issue on Linux Mint 21.3 using a 7900XTX after I had updated ooba. But after setting fasttensors = True it was completely fixed, thankfully.

Really wondering what seems to have caused this issue, since I have heard other people reporting it around when quadratic sampling picked up some steam.

NinjaPerson24119 commented 6 months ago

Can confirm, probably a one off result on my last result because it seems to keep the model loaded in system RAM after loading and unloading to GPU when fast safetensors is off. Speed is about the same.

rsoika commented 5 months ago

I have maybe a similar problem. I run in a Docker container and the call of model.load_autosplit(cache) always leads to a RuntimeError:

app-1  | ==========
app-1  | == CUDA ==
app-1  | ==========
app-1  | 
app-1  | CUDA Version 12.1.1
app-1  | 
app-1  | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
app-1  | 
app-1  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
app-1  | By pulling and using the container, you accept the terms and conditions of this license:
app-1  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
app-1  | 
app-1  | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
app-1  | 
app-1  | Loading model:1 /models/Mistral-7B-Instruct-v0.2-5.0-bpw-exl2/
app-1  | Traceback (most recent call last):
app-1  |   File "/app/./main.py", line 29, in <module>
app-1  |     model.load_autosplit(cache)
app-1  |   File "/usr/local/lib/python3.10/dist-packages/exllamav2/model.py", line 349, in load_autosplit
app-1  |     for item in f: x = item
app-1  |   File "/usr/local/lib/python3.10/dist-packages/exllamav2/model.py", line 476, in load_autosplit_gen
app-1  |     raise RuntimeError("Insufficient VRAM for model and cache")
app-1  | RuntimeError: Insufficient VRAM for model and cache

I install exllamav2-0.0.17+cu121-cp310-cp310-linux_x86_64.whl directly and did not build it from sources. Do I still have a way to test the 'fasttensors' option?

rsoika commented 5 months ago

Sorry, in my case maybe the model I tried was just to big (4.5G). Now I downloaded a smaller one (2.5G) and the app works. https://huggingface.co/turboderp/Mistral-7B-instruct-exl2/tree/2.5bpw