Closed zaptrem closed 1 year ago
Running input_ids.cuda()
before feeding it into model.generate() resolves this error, but that isn't possible when using this equivalent pipeline route:
import pprint
from transformers import pipeline, logging
logging.set_verbosity_info()
name = "facebook/opt-30b"
text = "test prompt"
pipe = pipeline(model=name, model_kwargs= {"device_map": "auto", "load_in_8bit": True})
result = pipe(text, do_sample=True,
max_length=200,
top_k=50,
top_p=0.95,
num_return_sequences=3)
pprint.pprint(result)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hmm, I sadly won't have time to look into this anytime soon. cc @gante @ArthurZucker here maybe
FYI, I was able to work around this by passing device=0
to pipeline()
. Obviously not an ideal solution, but acceptable when also using device_map={"": 0}
.
Added to the list of generate
tasks -- @ArthurZucker lmk if you'd be interested in checking this issue!
Hey, gonna mark this as closed as #19468 fixes it!
Placing the input_ids
on "cuda"
solves the issue. A warning was added !
(this should not be closed as it is not documented :D )
cc @younesbelkada I think you fixed this a long time ago no? It was bits and bytes issues (I might be wrong as it seems someone had the same problem) but have seen similar issue
No this is not a bitsandbytes
issue as you can also reproduce it with float16
models. To reproduce you can just run:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MAX_NEW_TOKENS = 128
model_name = 'gpt2'
text = """
Q: On average Joe throws 25 punches per minute. A fight lasts 5 rounds of 3 minutes.
How many punches did he throw?\n
A: Let’s think step by step.\n"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_ids = tokenizer(text, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map='auto',
torch_dtype=torch.float16
)
generated_ids = model.generate(input_ids, max_length=len(input_ids[0])+25, do_sample=True, top_p=0.7)
print(tokenizer.decode(generated_ids[0]))
from: https://github.com/huggingface/transformers/pull/19468
This happens on edge cases where users pass a tensor that is on CPU to a model that is converted in float16
.
This is because some operations such as top_p
are called in a cpu half tensor since accelerate
sets the output on the same device as the input.
Given how convenient the pipeline
utilities are, it would be great if the input_ids.to(0)
workaround could be included there, so that code like in the comment above would run out of the box. I found this issue after trying to run a very similar script for FlanT5.
Hi @steve-marmalade Thanks for the message! Indeed, this is something we are trying to fix in https://github.com/huggingface/transformers/pull/21479
I think that the following should be supported.
import pprint
from transformers import pipeline, logging
name = "facebook/opt-30b"
text = "test prompt"
pipe = pipeline(model=name, model_kwargs= {"device_map": "auto", "load_in_8bit": True}, device = 0)
result = pipe(text, do_sample=True,max_length=200, top_k=50, top_p=0.95, num_return_sequences=3)
@ArthurZucker , not really as forcing device=0
and device_map="auto"
will lead to some unexpected behaviors that are described in #21479
Also as stated by @Narsil :
Imo using device_map and device should be an error (ambiguous intent)
Yes, but I mean for small models, withou device_map="auto"
should work
Hi @younesbelkada , that's awesome you are already working on it :superhero:
I will follow along on #21479
Closing as #21479 was merged
System Info
transformers
version: 4.22.0.dev0Who can help?
@patrickvonplaten @Lysandre
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When running the below example code, I get
RuntimeError: "topk_cpu" not implemented for 'Half'
I'm using device_map="auto", and the latest public version of bitsandbytes along withload_in_8bit=True
. Works fine when using greedy instead of topk/p sampling.Expected behavior
Inference should progress correctly and result should be printed to console.