theroyallab / tabbyAPI

An OAI compatible exllamav2 API that's both lightweight and fast
GNU Affero General Public License v3.0
468 stars 65 forks source link

Very Strange OOM errors across multiple GPU's. OOM's, BSOD's, extreme driver crashesh all stemming from TabbyAPI #187

Open SytanSD opened 2 weeks ago

SytanSD commented 2 weeks ago

OS

Windows

GPU Library

CUDA 12.x

Python version

3.11

Describe the bug

I have been dealing with this issue for a while. A brief synopsis is as follows: When loading things into VRAM on my GPU, I can go all the way to 23.5GB with no problems. I can tensor allocate 23.5GB, I can train LoRA's to 23.5, I can render things in blender to 23.5. I would like to add that exl2 in ooba seems to not be having this issue. I am back to ooba for the time being (but I MUCH prefer tabby)

But when I try to load large models in Tabby, it will randomly "forget" how much VRAM I have. For example: I have a model I load that uses 21GB VRAM. I can load it, and it will OOM anywhere from 3GB to 20GB saying I have no more VRAM left. I can validate this in task manager, NVIDIA SMI, NVIDIA Nsight. It will show only that amount of VRAM is used, and the rest is free. But here is the weird thing, it will cause my PC to start purging VRAM from applications to make more room. So say it OOM's at 16.4GB, it will start closing discord, fire fox, VSC, everything to try and squeeze more room. Sometimes it does this so hard it bluescreens my PC with error MEMORY_MANAGEMENT. Some of these OOM's will literally say "Tried to allocate 20MiB, 12.9GiB free, insufficient space"

Longer explanation:

Whenever this error happens and I keep Tabby open with a model loaded in exl2, my whole PC loses the ability to allocate more VRAM. It will show 19GB used, and when I try to allocate 3GB more, it will say insufficient space, as long as tabby is left open when this issue is triggered. The moment I close it, I can use full 23.5GB again. Here it is saying I have insufficient space to allocate 3GiB when Task manager/SMI/Nsight all say I have 5GB free:

image image

The moment I close tabby, I can load full 23.5GB again

Here are some of the solutions I have tried: Different GPU: Happens on all GPU's Driver update: Still happens DDU and driver update: Still happens Uninstall and reinstall CUDA: Still happens Uninstall and reinstall Python: Still happens Windows Repair: Still happens Drive Repair: Still happens Mem Test: Mem is fine VRAM Test: VRAM is fine Uninstall/Reinstall Tabby: Still happens (though this will likely fix it for a load or two) Restarting GPU driver: Still happens Restarting PC: Still happens (unless I fully power off at the switch, then back on for some reason) Clearing tensor allocations with a command: It will clear my whole GPU, and still OOM early Rolling back EXL2: Still happens Loading without Fast Tensors: Still happens Loading Small models: Works just fine, no problems until the OOM error gets big enough that even they can't load

Weird Behavior:

Filling VRAM with a dummy tensor, then loading a small LLM to max VRAM will not OOM at the same threshold. If I am getting OOM at 12GB, and I fill it to 12GB with a dummy tensor, then load a small LLM to 19GB, it works fine. Clearing that dummy tensor and trying to load a 13GB model, it will OOM. This suggests that the issue is tabby specifically, and not CUDA/drivers/other.

At one point, it crashed so hard that all of my GPU's showed error signs, I lost all my VRAM, my PC fell back onto windows basic drivers, and was detecting no displays. I had to do a full DDU for NVIDIA and Intel after this instance in order for my displays to work at all. Images of that happening: image image image image image image

Reproduction steps

Loading any exl2 over about 12GB in size can cause this issue. Sometimes smaller. I have loaded a 6GB LLM and had it tell me I had insufficient space at 4 GB, when I had 20 GiB free still. This issue has also happened on my 3060ti as low as only 2.3GB

Expected behavior

Its expected that the LLM will load. When it OOM's, it will say that there are several GB left of space in all programs, including NVIDIA SMI, Task Manager, and NVIDIA Nsight. The models I am loading have worked for weeks, and will still randomly work from time to time. I have been running local LLM's for almost 4 years at this point, and I have never had this sort of issue before.

Logs

image Here it is OOMing at only 17 GB image Here is the SS of the mem hitting 17, erroring, and then clearing. image Here it is happening at only 16GB, where it would then instantly suspend any program I tried to load for being Out Of Memory. The moment I closed Tabby, everything worked fine. image SS of it OOMing at 18GB this time, 2 times in a row. Third time it BSOD'd my PC with error MEMORY_MANAGEMENT

Additional context

This issue happened to me a few times about 1.5 months ago, and the mysteriously disappeared. I didn't do anything to remedy it. It then came back about a month later, and I have been dealing with it daily. Turning my PC off fully (like unplugging it) seems to fix the issue. I think it might be clearing something building up on the GPU's maybe? Deleting the Venv and reinstalling it will also fix it for a single load, before it starts to OOM again.

Every time it OOM's in succession without causing a BSOD, it will run out of more and more memory. For example, last night I got it to do it 4 times in a row, and these were the values at which point it OOM'd: 19.6GB, 18.2 GB, 17.6GB, 16.4 GB, then a BSOD. Fully unplugging my PC fixed it for 1 go until it BSOD'd immediately after at 14.4GB

I have had this issue on Python 3.9, 3.10, and 3.11, as well as Cuda 11.8, 12.1, and 12.6

Acknowledgements

turboderp commented 2 weeks ago

Thank you for being verbose.

What stands out to me is the behavior you describe with other applications closing, which I've never seen happen in response to running out of VRAM, but it is kind of expected if you start running out of system RAM. It looks like you have 64 GB, which should be more than enough for the models you're talking, but there might be a memory leak somewhere?

Have you tried enabled fasttensors in Tabby's config.yml? It can sometimes issues in Windows related to safetensors and the way it handles memory-mapping. There was a bug in how that setting was applied in 0.2.0, but it's fixed in the dev branch if you're able to build from source. Otherwise there's an update coming soon with that and a few other fixes.

SytanSD commented 2 weeks ago

Have you tried enabled fasttensors in Tabby's config.yml? It can sometimes issues in Windows related to safetensors and the way it handles memory-mapping. There was a bug in how that setting was applied in 0.2.0, but it's fixed in the dev branch if you're able to build from source. Otherwise there's an update coming soon with that and a few other fixes.

I have had this issue on 0.1.8, 0.1.9, and 0.2.0, and one of the first things I tried was not loading with fast tensors. I am not sure if this is a tabby or an EXL2 issue at the moment. Oobabooga works just fine on 0.2.0, and I think its just a tabby issue, I have no idea for sure tho of course

I have been using oobabooga just fine, and I did a stress test of 200 loads and unloads, and it has been fine with no issues... So it really does make me think its just tabby, or something specifically causing issues with it. If it was a minor issue, I wouldn't mind, but literally locking up my PC and fully crashing my GPU's with a BSOD is a pretty huge issue

In response to the out of RAM situation, I have been nowhere near my sys RAM limit, so I also don't know. For me, when I hit the OOM limit, I can't allocate anything else. It will start un-loading programs like Spotify/Firefox/Discord/VSCode, and they will all freeze up, saying they are suspended. Closing and trying to load them again will make them repeatedly crash, and I need to close tabby for them to be able to load again. It gets so aggressive, it can sometimes unload my driver, or core windows files to try to make room (Device manager, Task manager, Event log, and even snipping tool are all examples of core windows programs that have unloaded from this OOM error)

SytanSD commented 1 week ago

@turboderp So, after a few days of using oobabooga with no updates, it has developed the same problem but only with exl2. This leads me to reasonably assume that it IS an EXL2 issue specifically. I am getting the exact same error as seen in tabby, but with a much lower BSOD chance, it seems. It behaves identical to the issue seen in Tabby. Here is a screenshot of its behavior from inside oobabooga:

image

As you can see, it literally says I have 18.41GiB free, but not enough to allocate 122MiB. It causes all the exact same issues as tabby, with full OOM errors, VRAM purging, the inability to access the rest of my VRAM. Its a nightmare, and it is identical to tabby in all of its issues and behavior

turboderp commented 1 week ago

Could you try 0.2.1? It has some minor fixes to the model loading code with fasttensors enabled.

SytanSD commented 5 days ago

@turboderp I wanted to give it some extra time, but I have since updated both tabby and oobabooga to 0.2.1, and I can say as of now, the issues have fully disappeared on both. I even stress test loaded models back to back. Not a single issue so far, so I am gonna go out on a limb and say you might have fixed it! I am not sure if its just a temporary fix, or if its actually fully fixed, but I am happy! Thank you so much!

I will say, I updated my Tabby to the new 0.2.1 (it was on 0.2.0, and was confirmed to BSOD every time I used it) without even clearing the venv, so the update alone fixed whatever was up without having to rebuild the venv, which makes me really think it WAS whatever you fixed

I will monitor and stress test it more before I end up closing this, but it does seem to have been an Exl2 issue, specifically

turboderp commented 5 days ago

Well, I completely replaced the code now, and it won't use the safetensors library at all for loading files going forward. So that should be a more permanent fix at any rate. It's only on the dev branch for now, though, as it needs a little more testing.