Trying to Run gpt4all on GPU, Windows 11: RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

Aunxfb commented 1 year ago

Summary: Can't get pass #RuntimeError: "addmm_implcpu" not implemented for 'Half'#

Since the error seems to be due to things not being run on GPU(?), I tried:

Add "@jit(target_backend='cuda')" to certain methods/functions
Add --precision full --no half arguments to python

I tried to ran gpt4all with GPU with the following code from the readMe:

from nomic.gpt4all import GPT4AllGPU
from transformers import LlamaTokenizer

m = GPT4AllGPU(".\\alpaca-lora-7b")
config = {'num_beams': 2,
          'min_new_tokens': 10,
          'max_length': 100,
          'repetition_penalty': 2.0}
out = m.generate('write me a story about a lonely computer', config)
print(out)

And downloaded some models from: https://huggingface.co/models

I tried llama-7b-hf and alpaca-lora-7b, both hit with the same error; The following is the full console output:

│ <my path>\test.py:9 in <module>                                                                 │
│                                                                                                  │
│    6 │   │     'min_new_tokens': 10,                                                             │
│    7 │   │     'max_length': 100,                                                                │
│    8 │   │     'repetition_penalty': 2.0}                                                        │
│ ❱  9 out = m.generate('write me a story about a lonely computer', config)                        │
│   10 print(out)                                                                                  │
│   11                                                                                             │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\nomic\gpt4all\gpt4all.py:47 in generate                       │
│                                                                                                  │
│    44 │   │   │   generate_config = {}                                                           │
│    45 │   │                                                                                      │
│    46 │   │   input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.to(self.model.   │
│ ❱  47 │   │   outputs = self.model.generate(input_ids=input_ids,                                 │
│    48 │   │   │   │   │   │   │   │   │     **generate_config)                                   │
│    49 │   │                                                                                      │
│    50 │   │   decoded = self.tokenizer.decode(outputs[0], skip_special_tokens=True).strip()      │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\peft\peft_model.py:581 in generate                            │
│                                                                                                  │
│   578 │   │   self.base_model.prepare_inputs_for_generation = self.prepare_inputs_for_generati   │
│   579 │   │   try:                                                                               │
│   580 │   │   │   if not isinstance(self.peft_config, PromptLearningConfig):                     │
│ ❱ 581 │   │   │   │   outputs = self.base_model.generate(**kwargs)                               │
│   582 │   │   │   else:                                                                          │
│   583 │   │   │   │   if "input_ids" not in kwargs:                                              │
│   584 │   │   │   │   │   raise ValueError("input_ids must be provided for Peft model generati   │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\torch\utils\_contextlib.py:115 in decorate_context            │
│                                                                                                  │
│   112 │   @functools.wraps(func)                                                                 │
│   113 │   def decorate_context(*args, **kwargs):                                                 │
│   114 │   │   with ctx_factory():                                                                │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                                   │
│   116 │                                                                                          │
│   117 │   return decorate_context                                                                │
│   118                                                                                            │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\transformers\generation\utils.py:1504 in generate             │
│                                                                                                  │
│   1501 │   │   │   │   **model_kwargs,                                                           │
│   1502 │   │   │   )                                                                             │
│   1503 │   │   │   # 13. run beam search                                                         │
│ ❱ 1504 │   │   │   return self.beam_search(                                                      │
│   1505 │   │   │   │   input_ids,                                                                │
│   1506 │   │   │   │   beam_scorer,                                                              │
│   1507 │   │   │   │   logits_processor=logits_processor,                                        │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\transformers\generation\utils.py:2763 in beam_search          │
│                                                                                                  │
│   2760 │   │   │                                                                                 │
│   2761 │   │   │   model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)  │
│   2762 │   │   │                                                                                 │
│ ❱ 2763 │   │   │   outputs = self(                                                               │
│   2764 │   │   │   │   **model_inputs,                                                           │
│   2765 │   │   │   │   return_dict=True,                                                         │
│   2766 │   │   │   │   output_attentions=output_attentions,                                      │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl                 │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\accelerate\hooks.py:165 in new_forward                        │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\transformers\models\llama\modeling_llama.py:710 in forward    │
│                                                                                                  │
│   707 │   │   return_dict = return_dict if return_dict is not None else self.config.use_return   │
│   708 │   │                                                                                      │
│   709 │   │   # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)    │
│ ❱ 710 │   │   outputs = self.model(                                                              │
│   711 │   │   │   input_ids=input_ids,                                                           │
│   712 │   │   │   attention_mask=attention_mask,                                                 │
│   713 │   │   │   position_ids=position_ids,                                                     │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl                 │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\accelerate\hooks.py:165 in new_forward                        │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\transformers\models\llama\modeling_llama.py:598 in forward    │
│                                                                                                  │
│   595 │   │   │   │   │   None,                                                                  │
│   596 │   │   │   │   )                                                                          │
│   597 │   │   │   else:                                                                          │
│ ❱ 598 │   │   │   │   layer_outputs = decoder_layer(                                             │
│   599 │   │   │   │   │   hidden_states,                                                         │
│   600 │   │   │   │   │   attention_mask=attention_mask,                                         │
│   601 │   │   │   │   │   position_ids=position_ids,                                             │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl                 │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\accelerate\hooks.py:165 in new_forward                        │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\transformers\models\llama\modeling_llama.py:313 in forward    │
│                                                                                                  │
│   310 │   │   hidden_states = self.input_layernorm(hidden_states)                                │
│   311 │   │                                                                                      │
│   312 │   │   # Self Attention                                                                   │
│ ❱ 313 │   │   hidden_states, self_attn_weights, present_key_value = self.self_attn(              │
│   314 │   │   │   hidden_states=hidden_states,                                                   │
│   315 │   │   │   attention_mask=attention_mask,                                                 │
│   316 │   │   │   position_ids=position_ids,                                                     │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl                 │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\accelerate\hooks.py:165 in new_forward                        │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\transformers\models\llama\modeling_llama.py:214 in forward    │
│                                                                                                  │
│   211 │   ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:       │
│   212 │   │   bsz, q_len, _ = hidden_states.size()                                               │
│   213 │   │                                                                                      │
│ ❱ 214 │   │   query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.   │
│   215 │   │   key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.he   │
│   216 │   │   value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.   │
│   217                                                                                            │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl                 │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\accelerate\hooks.py:165 in new_forward                        │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ <my path>\.venv\lib\site-packages\peft\tuners\lora.py:357 in forward                            │
│                                                                                                  │
│   354 │   │   │                                                                                  │
│   355 │   │   │   return F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bi   │
│   356 │   │   elif self.r > 0 and not self.merged:                                               │
│ ❱ 357 │   │   │   result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.   │
│   358 │   │   │   if self.r > 0:                                                                 │
│   359 │   │   │   │   result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling    │
│   360 │   │   │   return result                                                                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

Not sure if it matters, but my specs: 64GB RAM RTX 3080 10GB i7 12700k

GoMightyAlgorythmGo commented 1 year ago

I got it runing on windows 11 so its definitly possible not sure. Maybe ask chatGPT4 in chat.openai website. There are many things it had me do. Also it depends if i rightclick open command promt inside the folder or where i initiate it from

Aunxfb commented 1 year ago

I got it runing on windows 11 so its definitly possible not sure. Maybe ask chatGPT4 in chat.openai website. There are many things it had me do. Also it depends if i rightclick open command promt inside the folder or where i initiate it from

What did you ask chatGPT?

vbwyrde commented 1 year ago

I also got it running on Windows 11 with the following hardware: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz 3.19 GHz and Installed RAM 15.9 GB

I used the Visual Studio download, put the model in the chat folder and voila, I was able to run it. This was even before I had python installed (required for the GPT4All-UI). The model I used was gpt4all-lora-quantized.bin ... it worked out of the box for me. My setup took about 10 minutes. Maybe try deleting everything and starting over from scratch and don't do anything other than follow the instructions exactly? You probably did but who knows? Maybe something went wrong during the download. I'd try again.

However note: on my hardware the model works, but after a few minutes pegs my CPU at 100% and then gets very sketchy after that. Finally it will crash powershell and I have to start again. So keep that in mind for lower end windows machines... like mine.

Aunxfb commented 1 year ago

I also got it running on Windows 11 with the following hardware: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz 3.19 GHz and Installed RAM 15.9 GB

I used the Visual Studio download, put the model in the chat folder and voila, I was able to run it. This was even before I had python installed (required for the GPT4All-UI). The model I used was gpt4all-lora-quantized.bin ... it worked out of the box for me. My setup took about 10 minutes. Maybe try deleting everything and starting over from scratch and don't do anything other than follow the instructions exactly? You probably did but who knows? Maybe something went wrong during the download. I'd try again.

However note: on my hardware the model works, but after a few minutes pegs my CPU at 100% and then gets very sketchy after that. Finally it will crash powershell and I have to start again. So keep that in mind for lower end windows machines... like mine.

That did not sound like you ran it on GPU tbh... (the use of gpt4all-lora-quantized.bin gave it away...). Thanks for trying to help but that's not what I'm trying to do.. and I did follow the instructions exactly, specifically the "GPU Interface" section.

Edit: I did manage to run it the normal / CPU way, but it's quite slow so i want to utilize my GPU instead.

Slowly-Grokking commented 1 year ago

Anyone figure this out?

Slowly-Grokking commented 1 year ago

Tried https://huggingface.co/nomic-ai/gpt4all-j and got: RuntimeError: "LayerNormKernelImpl" not implemented for 'Half' full ouput: error.txt

FREQ-EE commented 1 year ago

I also got it running on Windows 11 with the following hardware: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz 3.19 GHz and Installed RAM 15.9 GB I used the Visual Studio download, put the model in the chat folder and voila, I was able to run it. This was even before I had python installed (required for the GPT4All-UI). The model I used was gpt4all-lora-quantized.bin ... it worked out of the box for me. My setup took about 10 minutes. Maybe try deleting everything and starting over from scratch and don't do anything other than follow the instructions exactly? You probably did but who knows? Maybe something went wrong during the download. I'd try again. However note: on my hardware the model works, but after a few minutes pegs my CPU at 100% and then gets very sketchy after that. Finally it will crash powershell and I have to start again. So keep that in mind for lower end windows machines... like mine.

That did not sound like you ran it on GPU tbh... (the use of gpt4all-lora-quantized.bin gave it away...). Thanks for trying to help but that's not what I'm trying to do.. and I did follow the instructions exactly, specifically the "GPU Interface" section.

Edit: I did manage to run it the normal / CPU way, but it's quite slow so i want to utilize my GPU instead.

I'm also in the same situation, similar specs to you, want to take advantage of my GPU having successfully run on CPU.

Did you end up finding a solution? Which model have yo used? I'm trying with llama-7b-hf.

mbutodembuti commented 1 year ago

Hi, I am a total newbie and I ma having the same issue. I followed all instructions in the nomic repo but can't find a way to fix this. where do you change the datatype to float32? I am using the sample py script in the repo to start and the model "decapoda-research/llama-7b-hf"

I tried on a 24Gb A5500 and an AMD Radeon Pro W6800 32Gb

Thx

toobashahid210 commented 1 year ago

Tried https://huggingface.co/nomic-ai/gpt4all-j and got: RuntimeError: "LayerNormKernelImpl" not implemented for 'Half' full ouput: error.txt

i am facing the same issue. Did you find any solution?

Slowly-Grokking commented 1 year ago

I haven't found a working model for GPU yet. I haven't troubleshooted much more than that though, Only trying to download various models gpt4all and lora/llama models hoping to get lucky. It's probably a simple solution to update the code, or to convert a model for it, but I haven't taken the time to understand what the model config requirements are and what ggml, quantized, etc mean in this context.

I'd look into seeing how the peft and transformers wheels were compiled, it might be as simple as updating those. Guessing because of:

\.venv\lib\site-packages\peft\tuners\lora.py:357 in forward

The other thing I've thought about doing is checking the code when the GPU instructions were added to the readme, and see what changes have been made in the code and dependency versions, etc since. Earliest mention of GPU I've found so far is here: https://github.com/nomic-ai/gpt4all/tree/e8c6aeeea27fe786b2bc6c3c32c2720c9660660e

mbutodembuti commented 1 year ago

I tried changing flotas16 to floats32 in the transformers json config file and in gpt4all...but no luck, same error. Also other models are not working for the same or other errors. I am a newbie in coding so haven't got the time now tocatch with 10 years of knowledge and debug 😄

niansa commented 1 year ago

Stale, please open a new issue if this is still relevant.

nomic-ai / gpt4all

Trying to Run gpt4all on GPU, Windows 11: RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' #292