oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.9k stars 5.23k forks source link

Run Llama 3 70b locally combining ram and vram like with other apps? #5965

Open 311-code opened 5 months ago

311-code commented 5 months ago

Sorry I am pretty novice here in LLM space. I have noticed that some users are able to run the llama 3 70b model as a gguf locally with quantization by offloading some of the model to cpu, ram, and vram with other programs with a larger context somehow. I don't really see any information on how to do this for text-generation-webui (which I much prefer)

I have 24gb vram and 64gb ram. Can anyone explain what model from the bloke? to download (if there's a better version, uncensored, etc) Someone referred me to this one which allows for a larger context length for llama 3 https://huggingface.co/models?sort=modified&search=llama+gradient+exl2 and what settings to set or if this is currently possible to do this with text-generation-webui?

I am also unsure if this can be done with exl2 or if I should be using gguf.

Edit: Apparantly flash attention was added today for llama.cpp https://github.com/ggerganov/llama.cpp/pull/5021 for larger contexts over 64k, not sure if this is relevant.

MTStrothers commented 5 months ago

So a few things... first off, I was asking about this earlier in the discussions but it takes a little while—a few weeks I guess—for the updates to llama.cpp to trickle into this program. That's because text-generation-webui doesn't use https://github.com/ggerganov/llama.cpp directly, it uses abetlen/llama-cpp-python/, which is, as I understand it, a port of llama.cpp into python. So once llama.cpp updates, then llama-cpp-python has to update, and THEN text-generation-webui has to update its compatibility to use the new version of llama-cpp-python. You can see in the requirements.txt file they just bumped this program to use llama-cpp-python 0.2.64, when the most recent release of llama-cpp-python is 0.2.68. I guess you could edit the requirements.txt of your local install but there's a good chance you'd break something idk.

As to your main question I'd recommend this version: https://huggingface.co/bartowski/Meta-Llama-3-70B-Instruct-GGUF

It wasn't quantized with the newest version of llama.cpp but still pretty recent. The guy who makes them says he will have an even newer version of llama 3 70b up in today-ish so keep an eye out for that. TheBloke is apparently retired btw. There are a ton of different versions with decensoring, extended context etc, really depends of your use case I guess. But I'm kinda skeptical of these finetunes at this stage because with llama 3 only recently coming out I don't think many of them are really dialed in.

To my knowledge, the only way to properly use both your cpu and gpu together is to use gguf. That will be what you want to do. You've got enough memory to run the 6_K quant without too much trouble I think, that's a pretty good sweet spot imo for reducing memory use without losing accuracy. You will have to splice the two files together for 6_k, but that's pretty easy you can do it with command line, just look it up.

To get it to run in text-generation-webui just drop it into your models folder and then load it. It should automatically default to 8k context. The only thing you will have to play with is n-gpu-layers in the model tab. Try like 20 or something to start and keep an eye on resource monitor and the CLI of text-generation-webui. Every layer you add in n-gpu-layers adds to the VRAM usage on your GPU. Just got to find the sweet spot.