Open TheDuckingDuck opened 1 year ago
use model.half() and use AMPWrapper.
model.half()
from amp_wrapper import AMPWrapper
wrapper = AMPWrapper(model)
wrapper.apply_generate()
Or use
model.half()
from model_attn_mlp_patch import make_quant_attn, make_fused_mlp, inject_lora_layers
make_quant_attn(model)
make_fused_mlp(model)
# Lora
inject_lora_layers(model, lora_path)
which is 50% faster
I'd love to try that out, but after looking at the code for the webui for a bit, I have come to the conclusion that I'm in way over my head here.
@TheDuckingDuck Try using llama30b-4bit with no groupsize and use text-generation-webui with the monkey patch. I could be mistaken but I believe I'm getting full context length in 24gb (1 x 3090 w/ CUDA_VISIBLE_DEVICES=0). You may also need to specify an environment variable to keep allocaiton size down. That env variable is shown in the OUT OF MEMORY error.
I am using alpaca_lora_4bit for fine-tuning via command line and it works like a champ. I then (currently) use text-generation-webui for inference. Just a few days ago, text-generation-webui got a new http/websockets interface, which may mean that using text-generation-webui for inference with a custom web front end may be the new way to go.
@TheDuckingDuck Try using llama30b-4bit with no groupsize and use text-generation-webui with the monkey patch. I could be mistaken but I believe I'm getting full context length in 24gb (1 x 3090 w/ CUDA_VISIBLE_DEVICES=0). You may also need to specify an environment variable to keep allocaiton size down. That env variable is shown in the OUT OF MEMORY error.
I am using alpaca_lora_4bit for fine-tuning via command line and it works like a champ. I then (currently) use text-generation-webui for inference. Just a few days ago, text-generation-webui got a new http/websockets interface, which may mean that using text-generation-webui for inference with a custom web front end may be the new way to go.
I'd love to hear more about that. Assuming you're talking about set 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512'
May I ask, which value works best for you?
@TheDuckingDuck Try using llama30b-4bit with no groupsize and use text-generation-webui with the monkey patch. I could be mistaken but I believe I'm getting full context length in 24gb (1 x 3090 w/ CUDA_VISIBLE_DEVICES=0). You may also need to specify an environment variable to keep allocaiton size down. That env variable is shown in the OUT OF MEMORY error.
I am using alpaca_lora_4bit for fine-tuning via command line and it works like a champ. I then (currently) use text-generation-webui for inference. Just a few days ago, text-generation-webui got a new http/websockets interface, which may mean that using text-generation-webui for inference with a custom web front end may be the new way to go.
how does the speed looking like @tensiondriven
Currently, it seems that when starting the WebUI with your --monkey-patch argument, it uses a lot more vram; even without loading a lora. (Using Llama 30B 128 group size)
Due to this, max context size on 24gb of vram goes 1600 -> 550.
Are there any ways or plans to improve on this in the future? I really like having lora but its hard to use with such small context :(