henk717 / KoboldAI

KoboldAI is generative AI software optimized for fictional use, but capable of much more!
http://koboldai.com
GNU Affero General Public License v3.0
352 stars 130 forks source link

Rapidly Increasing VRAM Use in Chat Mode #508

Closed tales59 closed 5 months ago

tales59 commented 5 months ago

I just used the henk717/KoboldAI Windows 10 installer Feb 15 and am new to this software. I'm using an A2000 12GB GPU with CUDA and loaded a few models available on the standard list (Pygmalion-2 13B, Tiefighter 13B, Mythalion 13B) and chatted for a bit with each, typically about KoboldAI use. As the models load, CUDA version 118 appears to be detected and my GPU is detected with compute capability 8.6. I get a warning that "the attention mask and the pad token id were not set."

After initally loading the model, approximately 8.5GB VRAM is in use. Responding to the prompts requires the full use of one core on my CPU and minimal use of the GPU processing. The bot responses are good, much better than expected. With each successive prompt, another 200-400MB of VRAM is used. Relatively quickly, VRAM becomes full and shared memory is used. When this occurs, my 3-4 tokens per second generation rate is cut by 90%, my GPU is obviously struggling (and heating up) and the CPU continues to fully use one core. My ancient Xeon x3480 CPU and 32GB 1333MHz RAM can't help much - actually, I only tried KoboldAI because alternatives that I've used require AVX instruction sets which my processor lacks.

Wondering if the rapidly increasing VRAM use is a feature or a bug related to the attention mask / pad token id. Is there any way to avoid using shared GPU memory with the 13B models on 12GB GPUs?

kobold_debug(3).json

henk717 commented 5 months ago

The attention mask and pad id message is normal, it automatically picks the correct versions. Increased vram is also nornal since it needs more vram to fit the context but does not do so up front so people have freedom to pick a max context size they can handle.

In the Nvidia settings you can disable the cuda fallback to prevent system ram being used but you still need enough vram to fit everything.

For your system I recommend checking our sister project https://koboldai.org/cpp which can load GGUF models. You can find those on htrps://huggingface.co (In this case search Tiefighter gguf) then once you have tiefighter's GGUF page open click on the files tab and download the Q4_K_S by clicking the small icon next to it.

tales59 commented 5 months ago

Thanks henk717,

Unfortunately, your sister project (cpp) that loads gguf models requires AVX instruction sets on the CPU - same as two alternatives I've tried (LM Studio and another with that I can't remember.) My x3480 lacks those instruction sets. Still, a 10-prompt chat that increases GPU memory use by 6GB seems excessive. I'm not yet convinced that you don't have a memory leak. Since KoboldAI is the only project I've found that loads on this machine, I'll try a couple of things before giving up:

  1. Load the official version of KoboldAI built in Nov 2023 to see if it exhibits the same behavior with rapidly increasing VRAM use. If not, I'll report that back to you.
  2. Load your project with the same model on another machine with 24GB VRAM. I suspect I'll see the same behavior, rapidly increasing VRAM use which eventually exhausts VRAM then slows to a crawl. It'll take longer than with the 12GB VRAM GPU but I'll be shocked if a 30-40 prompt chat with a 13B model doesn't exhaust the 24GB VRAM.

The vast majority of people interested in running local LLMs still have 12GB (or less) VRAM like mine. If KoboldAI can't run 13B models on 12GB VRAM, you've lost a lot of potential users. You may want to think about optimizing memory management to avoid loading so much potentially worthless context into VRAM. When VRAM use exceeds 90%, think about ways to prioritize worthwhile context instead of jamming it into shared memory. The overall user experience would likely improve.

I've had good luck with the 70B llama gguf 5bit Km models on my 24GB VRAM / 256GB RAM machines. However, the 13B gguf models haven't been functional for me. The Tiefighter and Pymalion-2 13B models on KoboldAI are the only two 13B's I've tried that were functional (until they ran out of VRAM.)

Greg

On Sun, Feb 18, 2024 at 4:34 AM henk717 @.***> wrote:

The attention mask and pad id message is normal, it automatically picks the correct versions. Increased vram is also nornal since it needs more vram to fit the context but does not do so up front so people have freedom to pick a max context size they can handle.

In the Nvidia settings you can disable the cuda fallback to prevent system ram being used but you still need enough vram to fit everything.

For your system I recommend checking our sister project https://koboldai.org/cpp which can load GGUF models. You can find those on htrps://huggingface.co (In this case search Tiefighter gguf) then once you have tiefighter's GGUF page open click on the files tab and download the Q4_K_S by clicking the small icon next to it.

— Reply to this email directly, view it on GitHub https://github.com/henk717/KoboldAI/issues/508#issuecomment-1951048497, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5VF2RSLHKNEQODKJP5WTYUHDJZAVCNFSM6AAAAABDNXLSZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJRGA2DQNBZG4 . You are receiving this because you authored the thread.Message ID: @.***>

henk717 commented 5 months ago

Ah I see, its still not a bug since its just Huggingface being inefficient (For me 4K context on 13B Tiefighter consumes 17.5GB instantly in my 4K test but then remains rock solid which indicates there is no leak. If there had been a leak it would have been gradual and consistent increases).

But fear not, we have other alternatives you can use on your particular setup. You simply need to switch to a better model format our software supports (Assuming your GPU is new enough for these formats).

Either use TheBloke/LLaMA2-13B-Tiefighter-AWQ which has the most efficient memory of the Tiefighter versions I tested this consumes exactly 12GB of vram on my system in the same test. Or use KoboldAI/LLaMA2-13B-Tiefighter-GPTQ which you can use on our ExllamaV2 backend. If you opt for the second one you will be able to use it up to 2K when first downloading the model, if you load it from a local folder you have a dropdown that says Huggingface. If you change that to Exllama(V2) you have faster speeds and a slider that will let you turn up the context to 4K again.

As for the context optimization suggestions, this already exists with the Max Context slider. Adjust accordingly as needed.

tales59 commented 5 months ago

Thanks.

Reducing context to 1024 allows me to use your Tiefighter 13B with my 12GB VRAM. After the model loads 7.7GB is used and it maxes out on chat at just under 10GB with 1024 context. Doubling the context to 2048 adds another 3-4GB of VRAM exhausting my A2000 GPU. So, you're right - there's no leak, huggingface just slowly loads context at the beginning of the sessions until the context limit is reached. I never realized context made this much difference in memory use. According to the Tiefighter 13B chatbot, doubling context increases VRAM use by 50-70%. That's consistent with my observations. Was surprised the bot came up with that.

I was also able to use Pygmalion-2 7B with context set to 3072 and keep the model in VRAM. I was surprised at how good this model was for chatting. Similar behavior when increasing context: Loaded model with 4.7GB VRAM - 2048 context tokens maxed out at about 9.6GB, 4096 tokens used over 13GB, overfilling my VRAM. So, now I understand how to make sure the models stay in VRAM.

Playing with the gguf models on more powerful machines with 256GB RAM, 70-80Gb/s RAM bandwidth, and 24GB VRAM, if wasn't so obvious that context made such a difference in memory use. I never worried about keeping them in VRAM since they were advertised to run on both CPU and GPU (and to some extent they do but not always after the bot starts responding).

Unfortunately, I've not yet had much luck loading other models from pytorch huggingface repositories with KoboldAI. I'll try the exLlamaV2 backend with the GPTQ variant tonight.

TheBloke/LLama-2-13B-Tiefighter-AWQ won't load with huggingface. I downloaded the model using "load model from pytorch repositories in huggingface" on KoboldAI. The download apparently went smoothly. The loading failed. My attached error is apparently fairly common but the solution is still not evident to me:

"DLL load failed while importing awq_inference_engine: The specified procedure could not be found."

There are some relatively recent comments on the ooga forums about this error so I'm not sure this is a problem unique to KoboldAI. AWS fixed this late last year according to one post.

Thanks,

Greg

On Sun, Feb 18, 2024 at 6:53 PM henk717 @.***> wrote:

Ah I see, its still not a bug since its just Huggingface being inefficient (For me 4K context on 13B Tiefighter consumes 17.5GB instantly in my 4K test but then remains rock solid which indicates there is no leak. If there had been a leak it would have been gradual and consistent increases).

But fear not, we have other alternatives you can use on your particular setup. You simply need to switch to a better model format our software supports (Assuming your GPU is new enough for these formats).

Either use TheBloke/LLaMA2-13B-Tiefighter-AWQ which has the most efficient memory of the Tiefighter versions I tested this consumes exactly 12GB of vram on my system in the same test. Or use KoboldAI/LLaMA2-13B-Tiefighter-GPTQ which you can use on our ExllamaV2 backend. If you opt for the second one you will be able to use it up to 2K when first downloading the model, if you load it from a local folder you have a dropdown that says Huggingface. If you change that to Exllama(V2) you have faster speeds and a slider that will let you turn up the context to 4K again.

— Reply to this email directly, view it on GitHub https://github.com/henk717/KoboldAI/issues/508#issuecomment-1951489281, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5VFZ765M3OCSBBSBLJ3DYUKIAFAVCNFSM6AAAAABDNXLSZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJRGQ4DSMRYGE . You are receiving this because you authored the thread.Message ID: @.***>

henk717 commented 5 months ago

I know the AWQ backend itself works, so I suspect AWQ was compiled for avx2 upstream. Hopefully one of the GPTQ versions is compatible with your system, otherwise you indeed will have to keep the context low. I suspect the BNB backend used to make the 16-bit models 4-bit automatically (The one you got working) is keeping context in the 16-bit format which is why its not proportional to the model itself.