RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

davidenitti commented 1 year ago

When I try to run the model I have: RuntimeError: "addmm_implcpu" not implemented for 'Half'

which should mean that the model is on cpu and thus it doesn't support half precision. However, I have cuda and the device is cuda at least for the model loaded with LlamaForCausalLM, but the one loaded with PeftModel is in cpu, not sure if this is related the issue. (I added offload_folder because it was required):

model = LlamaForCausalLM.from_pretrained(
            base_model,
            load_in_8bit=load_8bit,
            torch_dtype=torch.float16,
            device_map="auto",
            offload_folder=offload_folder
        )
print(model.device) # model is in cuda

model = PeftModel.from_pretrained(
            model,
            lora_weights,
            torch_dtype=torch.float16,
            offload_folder=offload_folder
        )
print(model.device) # model is in cpu

any idea to fix the issue? with float32 it works but it's super slow. I don't wont to use 8 bits, I also had issues with bitsandbytes for gpu, I remove it for the moment. (but this is another story)

full error:

Traceback (most recent call last):
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/blocks.py", line 929, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/utils.py", line 490, in async_iteration
    return next(iterator)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/interface.py", line 621, in fn
    for output in self.fn(*args):
  File "/home/davide/code/alpaca-lora/generate.py", line 159, in evaluate
    generation_output = model.generate(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/peft/peft_model.py", line 716, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1524, in generate
    return self.beam_search(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2810, in beam_search
    outputs = self(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/peft/tuners/lora.py", line 466, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

ElleLeonne commented 1 year ago

Try putting with torch.autocast("cuda"): at the start of your evaluate function

def generate_response(
    instruction,
    inputs=None,
    temperature=0.7,
    top_p=0.75,
    top_k=40,
    num_beams=4,
    max_new_tokens=128,
    **kwargs,
):
    prompt = prompter.generate_prompt(instruction, inputs)

    with torch.autocast("cuda"): #Useful if you receive the scalar error.
        inputs = tokenizer(prompt, return_tensors="pt")
        input_ids = inputs["input_ids"].to(device)
        generation_config = GenerationConfig(
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            num_beams=num_beams,
            length_penalty=length_penalty,
            **kwargs,
        )
        with torch.no_grad():
            generation_output = model.generate(
                input_ids=input_ids,
                generation_config=generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=max_new_tokens,
        )

davidenitti commented 1 year ago

I have the same error, I put the full error in the first post

ElleLeonne commented 1 year ago

Hmm, I'm not sure then peft_model is just a wrapper for LoRA though, it should use the same mechanism for both

davidenitti commented 1 year ago

@ElleLeonne can you tell me the version of the packages used? like torch or other relevant?

Aditya0123456789 commented 1 year ago

@ElleLeonne I am trying to run generate.py code in cpu but i am getting same error I change float16 to float32 Can you please help

ElleLeonne commented 1 year ago

In CPU? I have no experience, but llama.ccp is a CPU pipeline for llama, you could check out their repo (or try autocasting to CPU maybe? Idk, never done it) @Aditya0123456789 https://github.com/ggerganov/llama.cpp

Though, I am a bit leary that you said you simply "changed" the precision.

davidenitti commented 1 year ago

is PeftModel the cause of the RuntimeError: "addmm_implcpu" error. without it it works, but I guess it's needed to load the full model.

cologne-12 commented 1 year ago

@davidenitti could you upload that code without peftmodel that is working for you

davidenitti commented 1 year ago

@davidenitti could you upload that code without peftmodel that is working for you

it was only a test because the peftmodel is necessary to load the alpaca-lora model, however I managed to fix it in someway upgrading transformers and peft. first uninstall both of them then use

pip install git+https://github.com/huggingface/peft.git
pip install git+https://github.com/huggingface/transformers.git

for the code I use generate.py changing this part below to make sure that it doesn't go cuda out of memory (you can change it to float32 to make sure it works (change the max memory if you have more or less GPU memory):

dtype = torch.float16
    if device == "cuda":
        model = LlamaForCausalLM.from_pretrained(
            base_model,
            load_in_8bit=load_8bit,
            torch_dtype=dtype,
            device_map="auto",
            offload_folder=offload_folder,
            max_memory={0: "5GiB"}
        )
        model = PeftModel.from_pretrained(
            model,
            lora_weights,
            torch_dtype=dtype,
            device_map="auto",
            offload_folder=offload_folder,
            max_memory={0: "5GiB"}
        )

firefly2442 commented 1 year ago

I ran into the same error message when launching. In my case, I commented out this section in generate.py since I'm purely on a CPU.

# if not load_8bit:
        # model.half()  # seems to fix bugs for some users.

ElleLeonne commented 1 year ago

I ran into the same error message when launching. In my case, I commented out this section in generate.py since I'm purely on a CPU.
# if not load_8bit:
        # model.half()  # seems to fix bugs for some users.

If you're using the latest version of PEFT, this line of code is actually completely redundant now anyways

awsl-dbq commented 1 year ago

@davidenitti could you upload that code without peftmodel that is working for you

it was only a test because the peftmodel is necessary to load the alpaca-lora model, however I managed to fix it in someway upgrading transformers and peft. first uninstall both of them then use
pip install git+https://github.com/huggingface/peft.git
pip install git+https://github.com/huggingface/transformers.git
for the code I use generate.py changing this part below to make sure that it doesn't go cuda out of memory (you can change it to float32 to make sure it works (change the max memory if you have more or less GPU memory):
dtype = torch.float16
    if device == "cuda":
        model = LlamaForCausalLM.from_pretrained(
            base_model,
            load_in_8bit=load_8bit,
            torch_dtype=dtype,
            device_map="auto",
            offload_folder=offload_folder,
            max_memory={0: "5GiB"}
        )
        model = PeftModel.from_pretrained(
            model,
            lora_weights,
            torch_dtype=dtype,
            device_map="auto",
            offload_folder=offload_folder,
            max_memory={0: "5GiB"}
        )

Thanks ,it works for me.

ia2cobrit commented 1 year ago

alguien pudo solucionarlo

Cuando intento ejecutar el modelo tengo: RuntimeError: "addmm_implcpu" no implementado para 'Half'

lo que debería significar que el modelo está en la CPU y, por lo tanto, no admite la mitad de precisión. Sin embargo, tengo cuda y el dispositivo es cuda al menos para el modelo cargado con LlamaForCausalLM, pero el que está cargado con PeftModel está en la CPU, no estoy seguro si esto está relacionado con el problema. (Agregué offload_folder porque era necesario):

model = LlamaForCausalLM.from_pretrained(
            base_model,
            load_in_8bit=load_8bit,
            torch_dtype=torch.float16,
            device_map="auto",
            offload_folder=offload_folder
        )
print(model.device) # model is in cuda

model = PeftModel.from_pretrained(
            model,
            lora_weights,
            torch_dtype=torch.float16,
            offload_folder=offload_folder
        )
print(model.device) # model is in cpu

¿Alguna idea para solucionar el problema? con float32 funciona pero es súper lento. No suelo usar 8 bits, también tuve problemas con bitsandbytes para gpu, lo elimino por el momento. (Pero esta es otra historia)

error completo:

Traceback (most recent call last):
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/blocks.py", line 929, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/utils.py", line 490, in async_iteration
    return next(iterator)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/interface.py", line 621, in fn
    for output in self.fn(*args):
  File "/home/davide/code/alpaca-lora/generate.py", line 159, in evaluate
    generation_output = model.generate(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/peft/peft_model.py", line 716, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1524, in generate
    return self.beam_search(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2810, in beam_search
    outputs = self(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/peft/tuners/lora.py", line 466, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

RESOLVISTE TU PROBLEMA?, can you share your solution ?

sev7n4 commented 10 months ago

i have the same issue,RuntimeError: "addmm_implcpu" not implemented for 'Half',what is finally solution, need your advise.

sev7n4 commented 10 months ago

I have successfully solved this problem by way of

Start the webUI with the following Mac OS Terminal command.

cd stable-diffusion-webui ./webui.sh --precision full --no-half

Special Thank >> https://stable-diffusion-art.com/install-mac/comment-page-1/

https://stackoverflow.com/questions/75641074/i-run-stable-diffusion-its-wrong-runtimeerror-layernormkernelimpl-not-implem/75859574#75859574?newreg=4188d469fd6b411b9837743e51b57a78

tloen / alpaca-lora

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' #308