RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

realSAH commented 1 year ago

System Info

I'm feeding flags of low memory and half precision data type to AutoModelForCausalLM.from_pretrained('bigscience\bloomz7b1') and I'm receiving the error above.

I'm not sure if this is a bug, is it like those flags are only meant to be passed for specific models for which half precision is implemented? If so, how can one tell in a graceful way?

Those low memory flags seem to work like a dream with other models like EleutherAI/gpt-j-6B.

Thanks

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

as above.

Expected behavior

model loaded in half precision.

sgugger commented 1 year ago

You need to execute a model loaded in half precision on a GPU, the operations are not implemented in half on the CPU.

realSAH commented 1 year ago

@sgugger Then how come that this example works on cpu?

from transformers import GPTJForCausalLM
import torch

model = GPTJForCausalLM.from_pretrained(
    "EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
)

sgugger commented 1 year ago

What code are you using exactly to get the error?

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('bigscience/bloomz-7b1', torch_dtype=torch.float16)

works perfectly fine.

realSAH commented 1 year ago

@sgugger

Yes, it loads up perfectly fine but if you proceed to build the pipeline and generate text, you get the Half implementation error. I just tried your code again


import torch
from transformers import AutoModelForCausalLM, pipeline
model = AutoModelForCausalLM.from_pretrained('bigscience/bloomz-7b1', torch_dtype=torch.float16, low_cpu_mem_usage=True)
g = pipeline(task='text-generation', model=model, tokenizer='bigscience/bloomz-7b1')
g("Hi, ")

I got this traceback:


In [1]:
   ...:     import torch
   ...:     from transformers import AutoModelForCausalLM, pipeline
   ...:     model = AutoModelForCausalLM.from_pretrained('bigscience/bloomz-7b1', torch_dtype=torch.float16, low_cpu_mem_usage=True)
   ...:     g = pipeline(task='text-generation', model=model, tokenizer='bigscience/bloomz-7b1')
   ...:     g("Hi, ")
   ...:
C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\generation\utils.py:1273: UserWarning: Neither `max_length` nor `max_new_tokens` has been set, `max
_length` will default to 20 (`generation_config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the
config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>                                                                                      │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\pipelines\text_generation.py:210 in   │
│ __call__                                                                                         │
│                                                                                                  │
│   207 │   │   │   - **generated_token_ids** (`torch.Tensor` or `tf.Tensor`, present when `retu   │
│   208 │   │   │     ids of the generated text.                                                   │
│   209 │   │   """                                                                                │
│ ❱ 210 │   │   return super().__call__(text_inputs, **kwargs)                                     │
│   211 │                                                                                          │
│   212 │   def preprocess(self, prompt_text, prefix="", handle_long_generation=None, **generate   │
│   213 │   │   inputs = self.tokenizer(                                                           │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\pipelines\base.py:1084 in __call__    │
│                                                                                                  │
│   1081 │   │   │   │   )                                                                         │
│   1082 │   │   │   )                                                                             │
│   1083 │   │   else:                                                                             │
│ ❱ 1084 │   │   │   return self.run_single(inputs, preprocess_params, forward_params, postproces  │
│   1085 │                                                                                         │
│   1086 │   def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params):   │
│   1087 │   │   return [self.run_single(item, preprocess_params, forward_params, postprocess_par  │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\pipelines\base.py:1091 in run_single  │
│                                                                                                  │
│   1088 │                                                                                         │
│   1089 │   def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):  │
│   1090 │   │   model_inputs = self.preprocess(inputs, **preprocess_params)                       │
│ ❱ 1091 │   │   model_outputs = self.forward(model_inputs, **forward_params)                      │
│   1092 │   │   outputs = self.postprocess(model_outputs, **postprocess_params)                   │
│   1093 │   │   return outputs                                                                    │
│   1094                                                                                           │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\pipelines\base.py:992 in forward      │
│                                                                                                  │
│    989 │   │   │   │   inference_context = self.get_inference_context()                          │
│    990 │   │   │   │   with inference_context():                                                 │
│    991 │   │   │   │   │   model_inputs = self._ensure_tensor_on_device(model_inputs, device=se  │
│ ❱  992 │   │   │   │   │   model_outputs = self._forward(model_inputs, **forward_params)         │
│    993 │   │   │   │   │   model_outputs = self._ensure_tensor_on_device(model_outputs, device=  │
│    994 │   │   │   else:                                                                         │
│    995 │   │   │   │   raise ValueError(f"Framework {self.framework} is not supported")          │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\pipelines\text_generation.py:252 in   │
│ _forward                                                                                         │
│                                                                                                  │
│   249 │   │   │   in_b = input_ids.shape[0]                                                      │
│   250 │   │   prompt_text = model_inputs.pop("prompt_text")                                      │
│   251 │   │   # BS x SL                                                                          │
│ ❱ 252 │   │   generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=att   │
│   253 │   │   out_b = generated_sequence.shape[0]                                                │
│   254 │   │   if self.framework == "pt":                                                         │
│   255 │   │   │   generated_sequence = generated_sequence.reshape(in_b, out_b // in_b, *genera   │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\torch\autograd\grad_mode.py:27 in decorate_context │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\generation\utils.py:1391 in generate  │
│                                                                                                  │
│   1388 │   │   │   │   )                                                                         │
│   1389 │   │   │                                                                                 │
│   1390 │   │   │   # 11. run greedy search                                                       │
│ ❱ 1391 │   │   │   return self.greedy_search(                                                    │
│   1392 │   │   │   │   input_ids,                                                                │
│   1393 │   │   │   │   logits_processor=logits_processor,                                        │
│   1394 │   │   │   │   stopping_criteria=stopping_criteria,                                      │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\generation\utils.py:2179 in           │
│ greedy_search                                                                                    │
│                                                                                                  │
│   2176 │   │   │   model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)  │
│   2177 │   │   │                                                                                 │
│   2178 │   │   │   # forward pass to get next token                                              │
│ ❱ 2179 │   │   │   outputs = self(                                                               │
│   2180 │   │   │   │   **model_inputs,                                                           │
│   2181 │   │   │   │   return_dict=True,                                                         │
│   2182 │   │   │   │   output_attentions=output_attentions,                                      │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\torch\nn\modules\module.py:1194 in _call_impl      │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\models\bloom\modeling_bloom.py:900 in │
│ forward                                                                                          │
│                                                                                                  │
│    897 │   │                                                                                     │
│    898 │   │   return_dict = return_dict if return_dict is not None else self.config.use_return  │
│    899 │   │                                                                                     │
│ ❱  900 │   │   transformer_outputs = self.transformer(                                           │
│    901 │   │   │   input_ids,                                                                    │
│    902 │   │   │   past_key_values=past_key_values,                                              │
│    903 │   │   │   attention_mask=attention_mask,                                                │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\torch\nn\modules\module.py:1194 in _call_impl      │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\models\bloom\modeling_bloom.py:729 in │
│ forward                                                                                          │
│                                                                                                  │
│    726 │   │   if inputs_embeds is None:                                                         │
│    727 │   │   │   inputs_embeds = self.word_embeddings(input_ids)                               │
│    728 │   │                                                                                     │
│ ❱  729 │   │   hidden_states = self.word_embeddings_layernorm(inputs_embeds)                     │
│    730 │   │                                                                                     │
│    731 │   │   presents = () if use_cache else None                                              │
│    732 │   │   all_self_attentions = () if output_attentions else None                           │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\torch\nn\modules\module.py:1194 in _call_impl      │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\torch\nn\modules\normalization.py:190 in forward   │
│                                                                                                  │
│   187 │   │   │   init.zeros_(self.bias)                                                         │
│   188 │                                                                                          │
│   189 │   def forward(self, input: Tensor) -> Tensor:                                            │
│ ❱ 190 │   │   return F.layer_norm(                                                               │
│   191 │   │   │   input, self.normalized_shape, self.weight, self.bias, self.eps)                │
│   192 │                                                                                          │
│   193 │   def extra_repr(self) -> str:                                                           │
│                                                                                                  │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\torch\nn\functional.py:2515 in layer_norm          │
│                                                                                                  │
│   2512 │   │   return handle_torch_function(                                                     │
│   2513 │   │   │   layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, b  │
│   2514 │   │   )                                                                                 │
│ ❱ 2515 │   return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.c  │
│   2516                                                                                           │
│   2517                                                                                           │
│   2518 def group_norm(                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

By the way, it also complained about accelerate library not being installed saying that its crucial for low_cpu and half precision. Then after installation, the loading works fine, but text generation still fails.

So the question would be, why does it still work with GPT-J as per the official example on huggingface docs.

baiwan-chenhao commented 1 year ago

also fault when trying the blog of transformers in https://mp.weixin.qq.com/s/k8rE9GrF97E-0TKJhih9kw.

sgugger commented 1 year ago

As I said before, you need to run your model on the GPU as the operations are not all implemented on the CPU in float16. On CPU you can only run models in float32.

realSAH commented 1 year ago

Okay, thanks for explaining that. I think an update for docs would be appropriate.

https://huggingface.co/docs/transformers/model_doc/gptj

One can indicate that low precision example that works on CPU is just a coincidence as the operations happen to be implemented for CPU. In general, this requires acceleration device.

I'm not sure if Pytorch have cpu implementation on their agenda.

sgugger commented 1 year ago

Thanks for pointing this example out! It indeed needs to be add a GPU to work. cc @stevhliu or @MKhalusova if you want to fix it (it's the example just before GPTJConfig on the page linked above that loads the model in float16).

akdcelcopr77 commented 1 year ago

Tesla P40 not support Half...

huggingface / transformers