Autograd4bitQuantLinear() does not have a parameter or a buffer named zeros

neuhaus commented 1 year ago

I followed the readme today and got the following error during "docker run":

By the way, is "DOCCKER_BUILDKIT" a typo in the readme?

$ git clone https://github.com/johnsmith0031/alpaca_lora_4bit.git
$ cd alpaca_lora_4bit
$ DOCCKER_BUILDKIT=1 docker build -t alpaca_lora_4bit . # build step can take 12 min
$ docker run --gpus=all -p 7860:7860 alpaca_lora_4bit

==========
== CUDA ==
==========

CUDA Version 11.7.0

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

Triton not found. Please run "pip install triton".

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /root/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
/root/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
  warn(msg)
/root/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
Monkey Patch Completed.
Loading ../llama-7b-4bit.pt ...
Loading Model ...
Traceback (most recent call last):
  File "/alpaca_lora_4bit/text-generation-webui/server.py", line 305, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/alpaca_lora_4bit/text-generation-webui/custom_monkey_patch.py", line 19, in load_model_llama
    model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path, groupsize=-1)
  File "/alpaca_lora_4bit/text-generation-webui/autograd_4bit.py", line 202, in load_llama_model_4bit_low_ram
    model = accelerate.load_checkpoint_and_dispatch(
  File "/root/.local/lib/python3.10/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
    load_checkpoint_in_model(
  File "/root/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 946, in load_checkpoint_in_model
    set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype)
  File "/root/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 131, in set_module_tensor_to_device
    raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
ValueError: Autograd4bitQuantLinear() does not have a parameter or a buffer named zeros.

johnsmith0031 commented 1 year ago

Maybe it's because the current repo use v2 model by default, If you can edit the files in docker I think you can set is_v1_model=True when calling load_llama_model_4bit_low_ram

andybarry commented 1 year ago

Yes it is a typo in the readme, sorry :/

The issue is still happening if you set is_v1_model=True because the constructor of Autograd4bitQuantLinear is called twice in custom_monkey_patch.py

1st call: model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path, groupsize=-1, is_v1_model=True)

2nd call: model = PeftModel.from_pretrained(model, lora_path, device_map={'': 0}, torch_dtype=torch.float32)

In the 2nd call, the constructor is called again without the is_v1_model argument, so it fails. Perhaps johnsmith0031 knows how to fix this.

ehartford commented 1 year ago

I got same error trying to run my new model with the monkey script I based it from v2 model and I used this command to finetune the LoRA also i have done this before successfully so something changed since then

GPTQ_VERSION=2 python finetune .py --grad_chckpt --groupsize 128 --cutoff_len 2048 --llama_q4_model ./llama-7b-4bit-128g.safe tensors --llama_q4_config_dir ./llama-7b-4bit/ --lora_out_dir ./alpaca-leet10k-lora-7b/ ./leet10k-alpaca-merged.json

the error:

  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/./server.py", line 311, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/custom_monkey_patch.py", line 19, in load_model_llama
    model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path, groupsize=-1)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/autograd_4bit.py", line 132, in load_llama_model_4bit_low_ram
    model = accelerate.load_checkpoint_and_dispatch(
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
    load_checkpoint_in_model(
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 946, in load_checkpoint_in_model
    set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype)
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 131, in set_module_tensor_to_device
    raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
ValueError: Autograd4bitQuantLinear() does not have a parameter or a buffer named qzeros.

johnsmith0031 commented 1 year ago

I think you should reinstall the peft from requirements.txt as well, because I've also made some adjustment on that

dnouri commented 1 year ago

A freshly built Docker image with the latest from main will now fail with this traceback when trying to run text generation:

Traceback (most recent call last):
  File "/alpaca_lora_4bit/text-generation-webui/modules/callbacks.py", line 66, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/alpaca_lora_4bit/text-generation-webui/modules/text_generation.py", line 220, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/root/.local/lib/python3.10/site-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/root/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2524, in sample
    outputs = self(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/root/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/peft/tuners/lora.py", line 690, in forward
    result = super().forward(x)
  File "/alpaca_lora_4bit/text-generation-webui/autograd_4bit.py", line 132, in forward
    out = matmul4bit_with_backend(x, self.qweight, self.scales,
  File "/alpaca_lora_4bit/text-generation-webui/autograd_4bit.py", line 89, in matmul4bit_with_backend
    return mm4b.matmul4bit(x, qweight, scales, qzeros, g_idx)
  File "/alpaca_lora_4bit/text-generation-webui/matmul_utils_4bit.py", line 110, in matmul4bit
    output = _matmul4bit_v1_recons(x.to(scales.dtype), qweight, scales, zeros)
  File "/alpaca_lora_4bit/text-generation-webui/matmul_utils_4bit.py", line 79, in _matmul4bit_v1_recons
    quant_cuda.vecquant4recons_v1(qweight, buffer, scales, zeros)
RuntimeError: expected scalar type Half but found Float

Not sure if related, but could it make sense to try and pin the text-generation-webui repos that we're pulling in Dockerfile to specific Git hashes?

johnsmith0031 commented 1 year ago

try changing the line 27 in custom monkey patch from model.groupsize == -1 to model.is_v1_model If it works I’ll update the repo

dnouri commented 1 year ago

try changing the line 27 in custom monkey patch from model.groupsize == -1 to model.is_v1_model If it works I’ll update the repo

Brilliant, that fixes it indeed, thanks! Added a PR for your convenience in #73.

ehartford commented 1 year ago

I had to modify line19 to
model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path, groupsize=128, is_v1_model=False)

ehartford commented 1 year ago

Is there any change to be made here? or, should this issue be closed? (ie should the monkey patch figure out groupsize and is_v1_model on its own without requiring us to set it manually here)

johnsmith0031 commented 1 year ago

Yes I think we can close this issue, If there's any bugs related to this we can reopen it.

johnsmith0031 / alpaca_lora_4bit

Autograd4bitQuantLinear() does not have a parameter or a buffer named zeros #71