Closed neuhaus closed 1 year ago
Maybe it's because the current repo use v2 model by default, If you can edit the files in docker I think you can set is_v1_model=True when calling load_llama_model_4bit_low_ram
Yes it is a typo in the readme, sorry :/
The issue is still happening if you set is_v1_model=True
because the constructor of Autograd4bitQuantLinear
is called twice in custom_monkey_patch.py
1st call: model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path, groupsize=-1, is_v1_model=True)
2nd call: model = PeftModel.from_pretrained(model, lora_path, device_map={'': 0}, torch_dtype=torch.float32)
In the 2nd call, the constructor is called again without the is_v1_model
argument, so it fails. Perhaps johnsmith0031 knows how to fix this.
I got same error trying to run my new model with the monkey script I based it from v2 model and I used this command to finetune the LoRA also i have done this before successfully so something changed since then
GPTQ_VERSION=2 python finetune .py --grad_chckpt --groupsize 128 --cutoff_len 2048 --llama_q4_model ./llama-7b-4bit-128g.safe tensors --llama_q4_config_dir ./llama-7b-4bit/ --lora_out_dir ./alpaca-leet10k-lora-7b/ ./leet10k-alpaca-merged.json
the error:
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/./server.py", line 311, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/custom_monkey_patch.py", line 19, in load_model_llama
model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path, groupsize=-1)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/autograd_4bit.py", line 132, in load_llama_model_4bit_low_ram
model = accelerate.load_checkpoint_and_dispatch(
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
load_checkpoint_in_model(
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 946, in load_checkpoint_in_model
set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype)
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 131, in set_module_tensor_to_device
raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
ValueError: Autograd4bitQuantLinear() does not have a parameter or a buffer named qzeros.
I think you should reinstall the peft from requirements.txt as well, because I've also made some adjustment on that
A freshly built Docker image with the latest from main will now fail with this traceback when trying to run text generation:
Traceback (most recent call last):
File "/alpaca_lora_4bit/text-generation-webui/modules/callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/alpaca_lora_4bit/text-generation-webui/modules/text_generation.py", line 220, in generate_with_callback
shared.model.generate(**kwargs)
File "/root/.local/lib/python3.10/site-packages/peft/peft_model.py", line 581, in generate
outputs = self.base_model.generate(**kwargs)
File "/root/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
return self.sample(
File "/root/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2524, in sample
outputs = self(
File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
outputs = self.model(
File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/peft/tuners/lora.py", line 690, in forward
result = super().forward(x)
File "/alpaca_lora_4bit/text-generation-webui/autograd_4bit.py", line 132, in forward
out = matmul4bit_with_backend(x, self.qweight, self.scales,
File "/alpaca_lora_4bit/text-generation-webui/autograd_4bit.py", line 89, in matmul4bit_with_backend
return mm4b.matmul4bit(x, qweight, scales, qzeros, g_idx)
File "/alpaca_lora_4bit/text-generation-webui/matmul_utils_4bit.py", line 110, in matmul4bit
output = _matmul4bit_v1_recons(x.to(scales.dtype), qweight, scales, zeros)
File "/alpaca_lora_4bit/text-generation-webui/matmul_utils_4bit.py", line 79, in _matmul4bit_v1_recons
quant_cuda.vecquant4recons_v1(qweight, buffer, scales, zeros)
RuntimeError: expected scalar type Half but found Float
Not sure if related, but could it make sense to try and pin the text-generation-webui
repos that we're pulling in Dockerfile
to specific Git hashes?
try changing the line 27 in custom monkey patch from model.groupsize == -1 to model.is_v1_model If it works I’ll update the repo
try changing the line 27 in custom monkey patch from model.groupsize == -1 to model.is_v1_model If it works I’ll update the repo
Brilliant, that fixes it indeed, thanks! Added a PR for your convenience in #73.
I had to modify line19 to
model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path, groupsize=128, is_v1_model=False)
Is there any change to be made here? or, should this issue be closed? (ie should the monkey patch figure out groupsize and is_v1_model on its own without requiring us to set it manually here)
Yes I think we can close this issue, If there's any bugs related to this we can reopen it.
I followed the readme today and got the following error during "docker run":
By the way, is "DOCCKER_BUILDKIT" a typo in the readme?