RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (HF/Accelerate)

System Info

transformers version: 4.41.2
Platform: Linux-4.15.0-43-generic-x86_64-with-glibc2.17
Python version: 3.8.13
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.1
Accelerate version: 0.29.3
Accelerate config: not found
PyTorch version (GPU?): 1.13.1+cu116 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help?

@SunMarc , @ArthurZucker , @younesbelkada and @muellerzr

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I'm trying to run Big Model Inference with HF's accelerate package with the following code (in multi-GPU) setting, but keep getting cuda-related error attached below. Here's the code:

Code:

from huggingface_hub import snapshot_download
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from accelerate import load_checkpoint_and_dispatch
from accelerate import Accelerator, init_empty_weights
from torch.utils.data import Dataset, DataLoader
import torch

# A simple dataset class
class TextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=512):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        inputs = self.tokenizer(text, return_tensors="pt", max_length=self.max_length, truncation=True, padding="max_length")
        return inputs

# Some random text
input_texts = [
    "Once upon a time, in a land far, far away...",
    "In the beginning, there was darkness, and then there was light.",
    "The quick brown fox jumps over the lazy dog.",
    "To be or not to be, that is the question.",
    "A journey of a thousand miles begins with a single step."
]

accelerator = Accelerator()
checkpoint = "microsoft/Phi-3-medium-4k-instruct"
weights_location = snapshot_download(repo_id=checkpoint)

model_config = AutoConfig.from_pretrained(checkpoint, trust_remote_code=True)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config=model_config)

model = load_checkpoint_and_dispatch(
    model, checkpoint=weights_location, device_map="auto", no_split_module_classes=['Block']
)

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
dataset = TextDataset(input_texts, tokenizer)
data_loader = DataLoader(dataset, batch_size=1)

model, data_loader = accelerator.prepare(model, data_loader)

for batch in data_loader:
    # Generate text
    outputs = model.generate(batch['input_ids'][0], max_new_tokens=50)

Error (on line model.generate(batch['input_ids'][0].to(device), max_new_tokens=50)):

Traceback (most recent call last): File "test.py", line 65, in outputs = model.generate(batch['input_ids'][0].to(device), max_new_tokens=50) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/generation/utils.py", line 1758, in generate result = self._sample( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/generation/utils.py", line 2397, in _sample outputs = self( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 1286, in forward outputs = self.model( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 1164, in forward layer_outputs = decoder_layer( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 894, in forward hidden_states = residual + self.resid_attn_dropout(attn_outputs) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Expected behavior

Generation of output text, from the Big model without any cuda-related error!

Hey @younesbelkada, I added the Phi3DecoderLayer to the no_split_modules argument array, still getting kind of the same error, apparently on a different spot:

flash-attention package not found, consider installing for better performance: /home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl3cow11cow_deleterEPv. Current flash-attenton does not support window_size. Either upgrade or use attn_implementation='eager'. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are not running the flash-attention implementation, expect numerical differences. Traceback (most recent call last): File "test.py", line 55, in outputs = model.generate(batch['input_ids'][0], max_new_tokens=50) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/generation/utils.py", line 1758, in generate result = self._sample( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/generation/utils.py", line 2397, in _sample outputs = self( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 1286, in forward outputs = self.model( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 1164, in forward layer_outputs = decoder_layer( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 885, in forward attn_outputs, self_attn_weights, present_key_value = self.self_attn( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, **kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 383, in forward key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/cache_utils.py", line 155, in update self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_cat)

huggingface / transformers