Open davidenitti opened 1 year ago
Try putting with torch.autocast("cuda"):
at the start of your evaluate function
def generate_response(
instruction,
inputs=None,
temperature=0.7,
top_p=0.75,
top_k=40,
num_beams=4,
max_new_tokens=128,
**kwargs,
):
prompt = prompter.generate_prompt(instruction, inputs)
with torch.autocast("cuda"): #Useful if you receive the scalar error.
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)
generation_config = GenerationConfig(
temperature=temperature,
top_p=top_p,
top_k=top_k,
num_beams=num_beams,
length_penalty=length_penalty,
**kwargs,
)
with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=max_new_tokens,
)
I have the same error, I put the full error in the first post
Hmm, I'm not sure then peft_model is just a wrapper for LoRA though, it should use the same mechanism for both
@ElleLeonne can you tell me the version of the packages used? like torch or other relevant?
@ElleLeonne I am trying to run generate.py code in cpu but i am getting same error I change float16 to float32 Can you please help
In CPU? I have no experience, but llama.ccp is a CPU pipeline for llama, you could check out their repo (or try autocasting to CPU maybe? Idk, never done it) @Aditya0123456789 https://github.com/ggerganov/llama.cpp
Though, I am a bit leary that you said you simply "changed" the precision.
is PeftModel the cause of the RuntimeError: "addmm_implcpu" error. without it it works, but I guess it's needed to load the full model.
@davidenitti could you upload that code without peftmodel that is working for you
@davidenitti could you upload that code without peftmodel that is working for you
it was only a test because the peftmodel is necessary to load the alpaca-lora model, however I managed to fix it in someway upgrading transformers and peft. first uninstall both of them then use
pip install git+https://github.com/huggingface/peft.git
pip install git+https://github.com/huggingface/transformers.git
for the code I use generate.py changing this part below to make sure that it doesn't go cuda out of memory (you can change it to float32 to make sure it works (change the max memory if you have more or less GPU memory):
dtype = torch.float16
if device == "cuda":
model = LlamaForCausalLM.from_pretrained(
base_model,
load_in_8bit=load_8bit,
torch_dtype=dtype,
device_map="auto",
offload_folder=offload_folder,
max_memory={0: "5GiB"}
)
model = PeftModel.from_pretrained(
model,
lora_weights,
torch_dtype=dtype,
device_map="auto",
offload_folder=offload_folder,
max_memory={0: "5GiB"}
)
I ran into the same error message when launching. In my case, I commented out this section in generate.py
since I'm purely on a CPU.
# if not load_8bit:
# model.half() # seems to fix bugs for some users.
I ran into the same error message when launching. In my case, I commented out this section in
generate.py
since I'm purely on a CPU.# if not load_8bit: # model.half() # seems to fix bugs for some users.
If you're using the latest version of PEFT, this line of code is actually completely redundant now anyways
@davidenitti could you upload that code without peftmodel that is working for you
it was only a test because the peftmodel is necessary to load the alpaca-lora model, however I managed to fix it in someway upgrading transformers and peft. first uninstall both of them then use
pip install git+https://github.com/huggingface/peft.git pip install git+https://github.com/huggingface/transformers.git
for the code I use generate.py changing this part below to make sure that it doesn't go cuda out of memory (you can change it to float32 to make sure it works (change the max memory if you have more or less GPU memory):
dtype = torch.float16 if device == "cuda": model = LlamaForCausalLM.from_pretrained( base_model, load_in_8bit=load_8bit, torch_dtype=dtype, device_map="auto", offload_folder=offload_folder, max_memory={0: "5GiB"} ) model = PeftModel.from_pretrained( model, lora_weights, torch_dtype=dtype, device_map="auto", offload_folder=offload_folder, max_memory={0: "5GiB"} )
Thanks ,it works for me.
alguien pudo solucionarlo
Cuando intento ejecutar el modelo tengo: RuntimeError: "addmm_implcpu" no implementado para 'Half'
lo que debería significar que el modelo está en la CPU y, por lo tanto, no admite la mitad de precisión. Sin embargo, tengo cuda y el dispositivo es cuda al menos para el modelo cargado con LlamaForCausalLM, pero el que está cargado con PeftModel está en la CPU, no estoy seguro si esto está relacionado con el problema. (Agregué offload_folder porque era necesario):
model = LlamaForCausalLM.from_pretrained( base_model, load_in_8bit=load_8bit, torch_dtype=torch.float16, device_map="auto", offload_folder=offload_folder ) print(model.device) # model is in cuda model = PeftModel.from_pretrained( model, lora_weights, torch_dtype=torch.float16, offload_folder=offload_folder ) print(model.device) # model is in cpu
¿Alguna idea para solucionar el problema? con float32 funciona pero es súper lento. No suelo usar 8 bits, también tuve problemas con bitsandbytes para gpu, lo elimino por el momento. (Pero esta es otra historia)
error completo:
Traceback (most recent call last): File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/routes.py", line 393, in run_predict output = await app.get_blocks().process_api( File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1108, in process_api result = await self.call_function( File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/blocks.py", line 929, in call_function prediction = await anyio.to_thread.run_sync( File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run result = context.run(func, *args) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/utils.py", line 490, in async_iteration return next(iterator) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/gradio/interface.py", line 621, in fn for output in self.fn(*args): File "/home/davide/code/alpaca-lora/generate.py", line 159, in evaluate generation_output = model.generate( File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/peft/peft_model.py", line 716, in generate outputs = self.base_model.generate(**kwargs) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1524, in generate return self.beam_search( File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2810, in beam_search outputs = self( File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward outputs = self.model( File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward layer_outputs = decoder_layer( File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/davide/code/alpaca-lora/venv/lib/python3.10/site-packages/peft/tuners/lora.py", line 466, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
RESOLVISTE TU PROBLEMA?, can you share your solution ?
i have the same issue,RuntimeError: "addmm_implcpu" not implemented for 'Half',what is finally solution, need your advise.
I have successfully solved this problem by way of
Start the webUI with the following Mac OS Terminal command.
cd stable-diffusion-webui ./webui.sh --precision full --no-half
Special Thank >> https://stable-diffusion-art.com/install-mac/comment-page-1/
When I try to run the model I have: RuntimeError: "addmm_implcpu" not implemented for 'Half'
which should mean that the model is on cpu and thus it doesn't support half precision. However, I have cuda and the device is cuda at least for the model loaded with LlamaForCausalLM, but the one loaded with PeftModel is in cpu, not sure if this is related the issue. (I added offload_folder because it was required):
any idea to fix the issue? with float32 it works but it's super slow. I don't wont to use 8 bits, I also had issues with bitsandbytes for gpu, I remove it for the moment. (but this is another story)
full error: