huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
128.81k stars 25.55k forks source link

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (HF/Accelerate) #31504

Open sajastu opened 2 weeks ago

sajastu commented 2 weeks ago

System Info

Who can help?

@SunMarc , @ArthurZucker , @younesbelkada and @muellerzr

Information

Tasks

Reproduction

I'm trying to run Big Model Inference with HF's accelerate package with the following code (in multi-GPU) setting, but keep getting cuda-related error attached below. Here's the code:

Code:

from huggingface_hub import snapshot_download
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from accelerate import load_checkpoint_and_dispatch
from accelerate import Accelerator, init_empty_weights
from torch.utils.data import Dataset, DataLoader
import torch

# A simple dataset class
class TextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=512):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        inputs = self.tokenizer(text, return_tensors="pt", max_length=self.max_length, truncation=True, padding="max_length")
        return inputs

# Some random text
input_texts = [
    "Once upon a time, in a land far, far away...",
    "In the beginning, there was darkness, and then there was light.",
    "The quick brown fox jumps over the lazy dog.",
    "To be or not to be, that is the question.",
    "A journey of a thousand miles begins with a single step."
]

accelerator = Accelerator()
checkpoint = "microsoft/Phi-3-medium-4k-instruct"
weights_location = snapshot_download(repo_id=checkpoint)

model_config = AutoConfig.from_pretrained(checkpoint, trust_remote_code=True)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config=model_config)

model = load_checkpoint_and_dispatch(
    model, checkpoint=weights_location, device_map="auto", no_split_module_classes=['Block']
)

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
dataset = TextDataset(input_texts, tokenizer)
data_loader = DataLoader(dataset, batch_size=1)

model, data_loader = accelerator.prepare(model, data_loader)

for batch in data_loader:
    # Generate text
    outputs = model.generate(batch['input_ids'][0], max_new_tokens=50)

Error (on line model.generate(batch['input_ids'][0].to(device), max_new_tokens=50)):

Traceback (most recent call last): File "test.py", line 65, in outputs = model.generate(batch['input_ids'][0].to(device), max_new_tokens=50) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/generation/utils.py", line 1758, in generate result = self._sample( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/generation/utils.py", line 2397, in _sample outputs = self( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 1286, in forward outputs = self.model( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 1164, in forward layer_outputs = decoder_layer( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 894, in forward hidden_states = residual + self.resid_attn_dropout(attn_outputs) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Expected behavior

Generation of output text, from the Big model without any cuda-related error!

younesbelkada commented 2 weeks ago

Hi @sajastu I looked at the traceback of the issue as well as the code on the Hub, can you also add Phi3DecoderLayer in no_split_modules ? The error seemed to happen here: https://huggingface.co/microsoft/Phi-3-medium-4k-instruct/blob/main/modeling_phi3.py#L899

sajastu commented 2 weeks ago

Hey @younesbelkada, I added the Phi3DecoderLayer to the no_split_modules argument array, still getting kind of the same error, apparently on a different spot:

flash-attention package not found, consider installing for better performance: /home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl3cow11cow_deleterEPv. Current flash-attenton does not support window_size. Either upgrade or use attn_implementation='eager'. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are not running the flash-attention implementation, expect numerical differences. Traceback (most recent call last): File "test.py", line 55, in outputs = model.generate(batch['input_ids'][0], max_new_tokens=50) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/generation/utils.py", line 1758, in generate result = self._sample( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/generation/utils.py", line 2397, in _sample outputs = self( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 1286, in forward outputs = self.model( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 1164, in forward layer_outputs = decoder_layer( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 885, in forward attn_outputs, self_attn_weights, present_key_value = self.self_attn( File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, **kwargs) File "/disk1/sasha/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-medium-4k-instruct/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc/modeling_phi3.py", line 383, in forward key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) File "/home/sasha/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/cache_utils.py", line 155, in update self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_cat)