huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.08k stars 26.82k forks source link

RuntimeError: Expected a 'mps:0' generator device but found 'cpu' #31897

Open katherine-atwell opened 3 months ago

katherine-atwell commented 3 months ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

While trying to run Trainer.train() on a Mac device, I run into the following error:

trainer.train()
  File "/Users/kateatwell/miniconda3/envs/huggingface-trainer/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/kateatwell/miniconda3/envs/huggingface-trainer/lib/python3.11/site-packages/transformers/trainer.py", line 1836, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/Users/kateatwell/miniconda3/envs/huggingface-trainer/lib/python3.11/site-packages/accelerate/data_loader.py", line 454, in __iter__
    current_batch = next(dataloader_iter)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kateatwell/miniconda3/envs/huggingface-trainer/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/Users/kateatwell/miniconda3/envs/huggingface-trainer/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    index = self._next_index()  # may raise StopIteration
            ^^^^^^^^^^^^^^^^^^
  File "/Users/kateatwell/miniconda3/envs/huggingface-trainer/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 621, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kateatwell/miniconda3/envs/huggingface-trainer/lib/python3.11/site-packages/torch/utils/data/sampler.py", line 287, in __iter__
    for idx in self.sampler:
  File "/Users/kateatwell/miniconda3/envs/huggingface-trainer/lib/python3.11/site-packages/torch/utils/data/sampler.py", line 167, in __iter__
    yield from torch.randperm(n, generator=generator).tolist()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kateatwell/miniconda3/envs/huggingface-trainer/lib/python3.11/site-packages/torch/utils/_device.py", line 78, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected a 'mps:0' generator device but found 'cpu'

This error is caused by the following code:

    model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=num_labels)
    model.to("mps")
    training_args = TrainingArguments(output_dir=output_dir, use_mps_device=True, remove_unused_columns=False)
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_data,
        eval_dataset=val_data,
        compute_metrics=compute_metrics,
    )
    trainer.train()

train_data and val_data are instances of the following custom dataset:

class TweetDataset(Dataset):
    def __init__(self, data, tokenizer, max_length):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
        input_ids = []
        labels = []
        encodings = []
        for i, row in data.iterrows():
            text = row["text"]
            label = row["is_misinfo"]
            encoding = tokenizer(text, padding="max_length", truncation=True)
            input_id = encoding["input_ids"]
            input_ids.append(torch.tensor(input_id))
            labels.append(label)
            encodings.append(encoding)
        self.encodings = encodings
        self.input_ids = torch.cat(input_ids).to("mps")
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val, device="mps") for key, val in self.encodings[idx].items()}
        item["labels"] = torch.tensor(self.labels[idx], device="mps")
        return item
    def __len__(self):
        return len(self.data)

Expected behavior

Initializing generators on MPS, as opposed to CPU, and not throwing an error

amyeroberts commented 2 months ago

cc @muellerzr @SunMarc

aroun-coumar commented 1 month ago

Hey @katherine-atwell , Is the generator device set to mps ?

MekkCyber commented 4 weeks ago

Hey @katherine-atwell, can you try the following code to test if it works :

from datasets import Dataset
import torch
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer
from datasets import load_dataset

# Load the IMDb dataset
imdb = load_dataset("imdb")

class TweetDataset(Dataset):
    def __init__(self, data, tokenizer, max_length):
        self._data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.encodings = tokenizer(
            [row["text"] for row in data],
            padding="max_length",
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt"
        )
        self.labels = torch.tensor([row["label"] for row in data])

    def __getitem__(self, idx):
        # Return the dict of tokenized tensors and add the label
        item = {key: val[idx] for key, val in self.encodings.items()}
        item["labels"] = self.labels[idx]
        return item

    def __len__(self):
        return len(self._data)

base_model = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token

# Create training and validation datasets
train_data = TweetDataset(imdb["train"], tokenizer, 1024)
val_data = TweetDataset(imdb["test"], tokenizer, 1024)

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=num_labels)
model.config.pad_token_id = model.config.eos_token_id
model.to("mps")

# Define training arguments
training_args = TrainingArguments(
    remove_unused_columns=False,
    output_dir="./"
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
)

# Start training
trainer.train()
github-actions[bot] commented 3 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.