Closed alexl83 closed 11 months ago
Are you on windows or Linux?
Linux, Ubuntu 22.10 with a local miniconda environment
This issue seems to be related to what you're experiencing. https://github.com/oobabooga/text-generation-webui/issues/256
It seems that --gpu-memory
is bugged. I've also been having issues with --auto-devices
. There might also be a memory leak somewhere.
I should be feasibly be able to run the 13b model with my 1060 6gb with --auto-devices
enabled but I haven't had any luck with it.
I'm now working around it by lowering "Maximum prompt size in tokens" to 1024 - I'm using 512 right now
@alexl83
Have you watched the behavior live as it is processing? Either through nvidiaXserver or nvidia-smi (would have to spam this but still works), if you did this you could have more information of the behavior as it is happening. Just a suggestion, and could lead to more specific answers.
@ImpossibleExchange - model is consistently taking around 17.7 GB VRAM - regardless of any command-line option On top of that, upon first launch it's loaded into RAM (around 33GB) and then moved to VRAM killing the model and reloading seems to skip the RAM step
@alexl83
Okay just spun up the 4bit and was running some text to it. For reference, I am running on 2x ampere cards for 48 GB total VRAM.
What I found out by sitting and spamming nvidia-smi is that I was getting around 22-23GB total used VRAM while the text was being generated. It would drop back down after it was finished, but it was hitting the numbers around where you were getting your "out of memory" error.
So, I would assume that is perhaps "normal" behavior/ usage for the time being. I also had the token length set for 200 tokens, not higher. This would lead me to assume if you have a higher token threshold for generation, you could be going higher.
Don't really know how much "help" this is, but I can confirm the Vram usage seems to be normal. I wasn't getting the out of ram message due to having 2x cards.
Perhaps try a smaller model is likely the best suggestion I can give sadly. = \ As I didn't see anything different on my box. Xubuntu OS.
Peace and all the best.
Thanks @ImpossibleExchange I appreciate your support investigating this :) Let's see, things are moving fast!
@alexl83
Also just ran the 30B on a single card, and yeah got an out of memory error. So, I guess that is that.
For some reason in 4 bit I am not getting OOM at all. Yes it takes 17gb but IME that is enough for 2k context.
Sorry about the slow response, was battling getting Clear linux to work to try it out.
Are you able to generate ? Or are you crashing?
I was getting similar usage at the beginning, but would get VRAM usage spiked during generation of outputs. This was my experience on both Manjaro and Xubuntu.
@alexl83
Did you try the --auto-devices
start argument? If other flags aren't helping you, this one at least got rid of the OOM errors.
I'm seeing the same issue with llama-30b-4bit-128g, and it seems to be worse compared to older 4-bit .pt models, so perhaps there was some recent change (from the batch that added .safetensors support) that causes increased VRAM use?
Okay, so this is down to groupsize. If you don't want to constantly run out of VRAM with llama-30b running on 24Gb, make sure that you use a model quantized with groupsize=1 rather than groupsize=128. E.g. one of these: https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.
Dear All, I'm running 30B in 4bit on my 4090 24 GB + Ryzen 7700X and 64GB ram
after generating some tokens asking to produce code I get out of memory errors using --gpu-memory has no effects server line
python server.py --auto-devices --gpu-memory 20 --load-in-4bit --cai-chat --listen --extensions gallery llama_prompts --model llama-30b-4bit
Using a reddit-found character and preset:
I get this error
Any help is appreciated Thank you!