johnsmith0031 / alpaca_lora_4bit

MIT License
533 stars 84 forks source link

Vram usage in WebUI with --monkey-patch #93

Open TheDuckingDuck opened 1 year ago

TheDuckingDuck commented 1 year ago

Currently, it seems that when starting the WebUI with your --monkey-patch argument, it uses a lot more vram; even without loading a lora. (Using Llama 30B 128 group size)

Due to this, max context size on 24gb of vram goes 1600 -> 550.

Are there any ways or plans to improve on this in the future? I really like having lora but its hard to use with such small context :(

johnsmith0031 commented 1 year ago

use model.half() and use AMPWrapper.

model.half()
from amp_wrapper import AMPWrapper
wrapper = AMPWrapper(model)
wrapper.apply_generate()

Or use

model.half()

from model_attn_mlp_patch import make_quant_attn, make_fused_mlp, inject_lora_layers
make_quant_attn(model)
make_fused_mlp(model)

# Lora
inject_lora_layers(model, lora_path)

which is 50% faster

TheDuckingDuck commented 1 year ago

I'd love to try that out, but after looking at the code for the webui for a bit, I have come to the conclusion that I'm in way over my head here.

tensiondriven commented 1 year ago

@TheDuckingDuck Try using llama30b-4bit with no groupsize and use text-generation-webui with the monkey patch. I could be mistaken but I believe I'm getting full context length in 24gb (1 x 3090 w/ CUDA_VISIBLE_DEVICES=0). You may also need to specify an environment variable to keep allocaiton size down. That env variable is shown in the OUT OF MEMORY error.

I am using alpaca_lora_4bit for fine-tuning via command line and it works like a champ. I then (currently) use text-generation-webui for inference. Just a few days ago, text-generation-webui got a new http/websockets interface, which may mean that using text-generation-webui for inference with a custom web front end may be the new way to go.

IdiotSandwichTheThird commented 1 year ago

@TheDuckingDuck Try using llama30b-4bit with no groupsize and use text-generation-webui with the monkey patch. I could be mistaken but I believe I'm getting full context length in 24gb (1 x 3090 w/ CUDA_VISIBLE_DEVICES=0). You may also need to specify an environment variable to keep allocaiton size down. That env variable is shown in the OUT OF MEMORY error.

I am using alpaca_lora_4bit for fine-tuning via command line and it works like a champ. I then (currently) use text-generation-webui for inference. Just a few days ago, text-generation-webui got a new http/websockets interface, which may mean that using text-generation-webui for inference with a custom web front end may be the new way to go.

I'd love to hear more about that. Assuming you're talking about set 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512'

May I ask, which value works best for you?

wesleysanjose commented 1 year ago

@TheDuckingDuck Try using llama30b-4bit with no groupsize and use text-generation-webui with the monkey patch. I could be mistaken but I believe I'm getting full context length in 24gb (1 x 3090 w/ CUDA_VISIBLE_DEVICES=0). You may also need to specify an environment variable to keep allocaiton size down. That env variable is shown in the OUT OF MEMORY error.

I am using alpaca_lora_4bit for fine-tuning via command line and it works like a champ. I then (currently) use text-generation-webui for inference. Just a few days ago, text-generation-webui got a new http/websockets interface, which may mean that using text-generation-webui for inference with a custom web front end may be the new way to go.

how does the speed looking like @tensiondriven