model.generate(**inputs) breaks when inputs are batched on GPU

vitalyshalumov commented 1 year ago

System Info

transformers version: 4.34.1
Platform: Linux-5.15.0-1050-azure-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.18.0
Safetensors version: 0.4.0
Accelerate version: 0.24.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I'm using a generate function on inputs that I put on GPU. I'm using a nllb model.

When everything works:

when using a string as an input on cpu
when using a string as an input on gpu
when using a batch as an input on cpu

When it breaks: when using a batch as an input on gpu:

Example code: Translation from English to English

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-1.3B",src_lang="eng_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-1.3B")
article ='This does not work' 

#works
#inputs = tokenizer([article, article, article ,article, article], return_tensors="pt")
inputs = tokenizer.batch_encode_plus([article, article, article ,article, article], return_tensors="pt").

#does not work
#inputs = tokenizer([article, article, article ,article, article], return_tensors="pt").to("cuda")
#inputs = tokenizer.batch_encode_plus([article, article, article ,article, article], return_tensors="pt").to("cuda")

translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"])
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[:]

The error given is:
Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
  File ..... in <module>
    translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"])
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

Expected behavior

Inference on batched inputs that are on GPU.

ArthurZucker commented 1 year ago

Doesn't seems like the model was put on the device when you did inputs.to("cuda") !Did you try setting model.to("cuda") as well?

vitalyshalumov commented 1 year ago

model.to('cuda') resolves the issue. Thanks!

huggingface / transformers