johnsmith0031 / alpaca_lora_4bit

MIT License
533 stars 84 forks source link

OOM on inference while i can finetune with more tokens #146

Closed nepeee closed 1 year ago

nepeee commented 1 year ago

Hi!

I finetuned a 30B llama1 model with full 2048 token context using xformers with a 3090. But when i try to do inference i get cuda oom after only 866(inc. generated) tokens. Is this normal?

I used the same code as in inference.py:

from alpaca_lora_4bit.autograd_4bit import load_llama_model_4bit_low_ram_and_offload, Autograd4bitQuantLinear
from alpaca_lora_4bit.monkeypatch.peft_tuners_lora_monkey_patch import replace_peft_model_with_int4_lora_model
replace_peft_model_with_int4_lora_model()

from alpaca_lora_4bit.monkeypatch.llama_attn_hijack_xformers import hijack_llama_attention
hijack_llama_attention()

model, tokenizer = load_llama_model_4bit_low_ram_and_offload(configPath, modelPath, lora_path=loraPath, groupsize=-1, seqlen=2048, max_memory=None, is_v1_model=False, bits=4)

wrapper = AMPWrapper(model)
wrapper.apply_generate()

...
with torch.no_grad():
    model.sample(...)
...
johnsmith0031 commented 1 year ago

Maybe it a bug of load_llama_model_4bit_low_ram_and_offload function in which it does not call model.half(), causing kv cache to be fp32 format. You need to call model.half() manually. Or you can use exllama for inferencing, which is much faster and need less VRAM so that you can use full 2048 context length.

nepeee commented 1 year ago

I switched to the autogptq/hf version, seems to be ok now, thx ❤️