Closed realSAH closed 1 year ago
You need to execute a model loaded in half precision on a GPU, the operations are not implemented in half on the CPU.
@sgugger Then how come that this example works on cpu?
from transformers import GPTJForCausalLM
import torch
model = GPTJForCausalLM.from_pretrained(
"EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
)
What code are you using exactly to get the error?
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('bigscience/bloomz-7b1', torch_dtype=torch.float16)
works perfectly fine.
@sgugger
Yes, it loads up perfectly fine but if you proceed to build the pipeline and generate text, you get the Half
implementation error.
I just tried your code again
import torch
from transformers import AutoModelForCausalLM, pipeline
model = AutoModelForCausalLM.from_pretrained('bigscience/bloomz-7b1', torch_dtype=torch.float16, low_cpu_mem_usage=True)
g = pipeline(task='text-generation', model=model, tokenizer='bigscience/bloomz-7b1')
g("Hi, ")
I got this traceback:
In [1]:
...: import torch
...: from transformers import AutoModelForCausalLM, pipeline
...: model = AutoModelForCausalLM.from_pretrained('bigscience/bloomz-7b1', torch_dtype=torch.float16, low_cpu_mem_usage=True)
...: g = pipeline(task='text-generation', model=model, tokenizer='bigscience/bloomz-7b1')
...: g("Hi, ")
...:
C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\generation\utils.py:1273: UserWarning: Neither `max_length` nor `max_new_tokens` has been set, `max
_length` will default to 20 (`generation_config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the
config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module> │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\pipelines\text_generation.py:210 in │
│ __call__ │
│ │
│ 207 │ │ │ - **generated_token_ids** (`torch.Tensor` or `tf.Tensor`, present when `retu │
│ 208 │ │ │ ids of the generated text. │
│ 209 │ │ """ │
│ ❱ 210 │ │ return super().__call__(text_inputs, **kwargs) │
│ 211 │ │
│ 212 │ def preprocess(self, prompt_text, prefix="", handle_long_generation=None, **generate │
│ 213 │ │ inputs = self.tokenizer( │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\pipelines\base.py:1084 in __call__ │
│ │
│ 1081 │ │ │ │ ) │
│ 1082 │ │ │ ) │
│ 1083 │ │ else: │
│ ❱ 1084 │ │ │ return self.run_single(inputs, preprocess_params, forward_params, postproces │
│ 1085 │ │
│ 1086 │ def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params): │
│ 1087 │ │ return [self.run_single(item, preprocess_params, forward_params, postprocess_par │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\pipelines\base.py:1091 in run_single │
│ │
│ 1088 │ │
│ 1089 │ def run_single(self, inputs, preprocess_params, forward_params, postprocess_params): │
│ 1090 │ │ model_inputs = self.preprocess(inputs, **preprocess_params) │
│ ❱ 1091 │ │ model_outputs = self.forward(model_inputs, **forward_params) │
│ 1092 │ │ outputs = self.postprocess(model_outputs, **postprocess_params) │
│ 1093 │ │ return outputs │
│ 1094 │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\pipelines\base.py:992 in forward │
│ │
│ 989 │ │ │ │ inference_context = self.get_inference_context() │
│ 990 │ │ │ │ with inference_context(): │
│ 991 │ │ │ │ │ model_inputs = self._ensure_tensor_on_device(model_inputs, device=se │
│ ❱ 992 │ │ │ │ │ model_outputs = self._forward(model_inputs, **forward_params) │
│ 993 │ │ │ │ │ model_outputs = self._ensure_tensor_on_device(model_outputs, device= │
│ 994 │ │ │ else: │
│ 995 │ │ │ │ raise ValueError(f"Framework {self.framework} is not supported") │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\pipelines\text_generation.py:252 in │
│ _forward │
│ │
│ 249 │ │ │ in_b = input_ids.shape[0] │
│ 250 │ │ prompt_text = model_inputs.pop("prompt_text") │
│ 251 │ │ # BS x SL │
│ ❱ 252 │ │ generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=att │
│ 253 │ │ out_b = generated_sequence.shape[0] │
│ 254 │ │ if self.framework == "pt": │
│ 255 │ │ │ generated_sequence = generated_sequence.reshape(in_b, out_b // in_b, *genera │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\torch\autograd\grad_mode.py:27 in decorate_context │
│ │
│ 24 │ │ @functools.wraps(func) │
│ 25 │ │ def decorate_context(*args, **kwargs): │
│ 26 │ │ │ with self.clone(): │
│ ❱ 27 │ │ │ │ return func(*args, **kwargs) │
│ 28 │ │ return cast(F, decorate_context) │
│ 29 │ │
│ 30 │ def _wrap_generator(self, func): │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\generation\utils.py:1391 in generate │
│ │
│ 1388 │ │ │ │ ) │
│ 1389 │ │ │ │
│ 1390 │ │ │ # 11. run greedy search │
│ ❱ 1391 │ │ │ return self.greedy_search( │
│ 1392 │ │ │ │ input_ids, │
│ 1393 │ │ │ │ logits_processor=logits_processor, │
│ 1394 │ │ │ │ stopping_criteria=stopping_criteria, │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\generation\utils.py:2179 in │
│ greedy_search │
│ │
│ 2176 │ │ │ model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) │
│ 2177 │ │ │ │
│ 2178 │ │ │ # forward pass to get next token │
│ ❱ 2179 │ │ │ outputs = self( │
│ 2180 │ │ │ │ **model_inputs, │
│ 2181 │ │ │ │ return_dict=True, │
│ 2182 │ │ │ │ output_attentions=output_attentions, │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\torch\nn\modules\module.py:1194 in _call_impl │
│ │
│ 1191 │ │ # this function, and just call forward. │
│ 1192 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │
│ 1193 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1194 │ │ │ return forward_call(*input, **kwargs) │
│ 1195 │ │ # Do not call functions when jit is used │
│ 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1197 │ │ if self._backward_hooks or _global_backward_hooks: │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\models\bloom\modeling_bloom.py:900 in │
│ forward │
│ │
│ 897 │ │ │
│ 898 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │
│ 899 │ │ │
│ ❱ 900 │ │ transformer_outputs = self.transformer( │
│ 901 │ │ │ input_ids, │
│ 902 │ │ │ past_key_values=past_key_values, │
│ 903 │ │ │ attention_mask=attention_mask, │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\torch\nn\modules\module.py:1194 in _call_impl │
│ │
│ 1191 │ │ # this function, and just call forward. │
│ 1192 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │
│ 1193 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1194 │ │ │ return forward_call(*input, **kwargs) │
│ 1195 │ │ # Do not call functions when jit is used │
│ 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1197 │ │ if self._backward_hooks or _global_backward_hooks: │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\transformers\models\bloom\modeling_bloom.py:729 in │
│ forward │
│ │
│ 726 │ │ if inputs_embeds is None: │
│ 727 │ │ │ inputs_embeds = self.word_embeddings(input_ids) │
│ 728 │ │ │
│ ❱ 729 │ │ hidden_states = self.word_embeddings_layernorm(inputs_embeds) │
│ 730 │ │ │
│ 731 │ │ presents = () if use_cache else None │
│ 732 │ │ all_self_attentions = () if output_attentions else None │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\torch\nn\modules\module.py:1194 in _call_impl │
│ │
│ 1191 │ │ # this function, and just call forward. │
│ 1192 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │
│ 1193 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1194 │ │ │ return forward_call(*input, **kwargs) │
│ 1195 │ │ # Do not call functions when jit is used │
│ 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1197 │ │ if self._backward_hooks or _global_backward_hooks: │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\torch\nn\modules\normalization.py:190 in forward │
│ │
│ 187 │ │ │ init.zeros_(self.bias) │
│ 188 │ │
│ 189 │ def forward(self, input: Tensor) -> Tensor: │
│ ❱ 190 │ │ return F.layer_norm( │
│ 191 │ │ │ input, self.normalized_shape, self.weight, self.bias, self.eps) │
│ 192 │ │
│ 193 │ def extra_repr(self) -> str: │
│ │
│ C:\Users\aalsaf01\venvs\nlp\lib\site-packages\torch\nn\functional.py:2515 in layer_norm │
│ │
│ 2512 │ │ return handle_torch_function( │
│ 2513 │ │ │ layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, b │
│ 2514 │ │ ) │
│ ❱ 2515 │ return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.c │
│ 2516 │
│ 2517 │
│ 2518 def group_norm( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'
By the way, it also complained about accelerate
library not being installed saying that its crucial for low_cpu
and half precision. Then after installation, the loading works fine, but text generation still fails.
So the question would be, why does it still work with GPT-J as per the official example on huggingface docs.
also fault when trying the blog of transformers in https://mp.weixin.qq.com/s/k8rE9GrF97E-0TKJhih9kw.
As I said before, you need to run your model on the GPU as the operations are not all implemented on the CPU in float16. On CPU you can only run models in float32.
Okay, thanks for explaining that. I think an update for docs would be appropriate.
https://huggingface.co/docs/transformers/model_doc/gptj
One can indicate that low precision example that works on CPU is just a coincidence as the operations happen to be implemented for CPU. In general, this requires acceleration device.
I'm not sure if Pytorch have cpu implementation on their agenda.
Thanks for pointing this example out! It indeed needs to be add a GPU to work. cc @stevhliu or @MKhalusova if you want to fix it (it's the example just before GPTJConfig on the page linked above that loads the model in float16).
Tesla P40 not support Half...
System Info
I'm feeding flags of low memory and half precision data type to
AutoModelForCausalLM.from_pretrained('bigscience\bloomz7b1')
and I'm receiving the error above.I'm not sure if this is a bug, is it like those flags are only meant to be passed for specific models for which half precision is implemented? If so, how can one tell in a graceful way?
Those low memory flags seem to work like a dream with other models like
EleutherAI/gpt-j-6B
.Thanks
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
as above.
Expected behavior
model loaded in half precision.