huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.3k stars 26.62k forks source link

'topk_cpu not implemented for half' when using topk with bitsandbytes 8-bit quant #18703

Closed zaptrem closed 1 year ago

zaptrem commented 2 years ago

System Info

Who can help?

@patrickvonplaten @Lysandre

Information

Tasks

Reproduction

When running the below example code, I get RuntimeError: "topk_cpu" not implemented for 'Half' I'm using device_map="auto", and the latest public version of bitsandbytes along with load_in_8bit=True. Works fine when using greedy instead of topk/p sampling.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", device_map="auto", load_in_8bit=True)

# the fast tokenizer currently does not work correctly
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m", use_fast=False)

prompt = "Hello, I am conscious and"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generated_ids = model.generate(
    input_ids, 
    do_sample=True,
    max_length=200, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=3)

result = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

print(result)

Expected behavior

Inference should progress correctly and result should be printed to console.

zaptrem commented 2 years ago

Running input_ids.cuda() before feeding it into model.generate() resolves this error, but that isn't possible when using this equivalent pipeline route:

import pprint
from transformers import pipeline, logging

logging.set_verbosity_info()

name = "facebook/opt-30b"
text = "test prompt"

pipe = pipeline(model=name, model_kwargs= {"device_map": "auto", "load_in_8bit": True})

result = pipe(text, do_sample=True,
    max_length=200, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=3)

pprint.pprint(result)
github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patrickvonplaten commented 2 years ago

Hmm, I sadly won't have time to look into this anytime soon. cc @gante @ArthurZucker here maybe

lesniewski commented 2 years ago

FYI, I was able to work around this by passing device=0 to pipeline(). Obviously not an ideal solution, but acceptable when also using device_map={"": 0}.

gante commented 2 years ago

Added to the list of generate tasks -- @ArthurZucker lmk if you'd be interested in checking this issue!

ArthurZucker commented 1 year ago

Hey, gonna mark this as closed as #19468 fixes it! Placing the input_ids on "cuda" solves the issue. A warning was added !

gante commented 1 year ago

(this should not be closed as it is not documented :D )

ArthurZucker commented 1 year ago

cc @younesbelkada I think you fixed this a long time ago no? It was bits and bytes issues (I might be wrong as it seems someone had the same problem) but have seen similar issue

younesbelkada commented 1 year ago

No this is not a bitsandbytes issue as you can also reproduce it with float16 models. To reproduce you can just run:


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MAX_NEW_TOKENS = 128
model_name = 'gpt2'

text = """
Q: On average Joe throws 25 punches per minute. A fight lasts 5 rounds of 3 minutes. 
How many punches did he throw?\n
A: Let’s think step by step.\n"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_ids = tokenizer(text, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  device_map='auto',
  torch_dtype=torch.float16
)
generated_ids = model.generate(input_ids, max_length=len(input_ids[0])+25, do_sample=True, top_p=0.7)
print(tokenizer.decode(generated_ids[0]))

from: https://github.com/huggingface/transformers/pull/19468

This happens on edge cases where users pass a tensor that is on CPU to a model that is converted in float16. This is because some operations such as top_p are called in a cpu half tensor since accelerate sets the output on the same device as the input.

steve-marmalade commented 1 year ago

Given how convenient the pipeline utilities are, it would be great if the input_ids.to(0) workaround could be included there, so that code like in the comment above would run out of the box. I found this issue after trying to run a very similar script for FlanT5.

younesbelkada commented 1 year ago

Hi @steve-marmalade Thanks for the message! Indeed, this is something we are trying to fix in https://github.com/huggingface/transformers/pull/21479

ArthurZucker commented 1 year ago

I think that the following should be supported.

import pprint
from transformers import pipeline, logging
name = "facebook/opt-30b"
text = "test prompt"
pipe = pipeline(model=name, model_kwargs= {"device_map": "auto", "load_in_8bit": True}, device = 0)
result = pipe(text, do_sample=True,max_length=200, top_k=50, top_p=0.95, num_return_sequences=3)
younesbelkada commented 1 year ago

@ArthurZucker , not really as forcing device=0 and device_map="auto" will lead to some unexpected behaviors that are described in #21479 Also as stated by @Narsil :

Imo using device_map and device should be an error (ambiguous intent)

ArthurZucker commented 1 year ago

Yes, but I mean for small models, withou device_map="auto" should work

steve-marmalade commented 1 year ago

Hi @younesbelkada , that's awesome you are already working on it :superhero:

I will follow along on #21479

ArthurZucker commented 1 year ago

Closing as #21479 was merged