Trainer.train -> Expected a 'cuda' device type for generator but found 'cpu'

diego-coba commented 3 months ago

System Info

Transformers 4.41.2 PyTorch 2.3.1+cu121 Python 3.12.3 Ubuntu 24.04

GPU: NVIDIA GeForce GTX 1650

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

`%pip install --quiet torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 %pip install --quiet -U datasets %pip install --quiet torchdata %pip install --quiet setuptools %pip install --quiet transformers %pip install --quiet evaluate %pip install --quiet rouge_score %pip install --quiet loralib %pip install --quiet peft %pip install --quiet ipywidgets

import torch

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)} is available and will be used.")
else:
    print("CUDA is not available. CPU will be used.")

dash_line = '-'.join('' for x in range(100))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType
import torch
import time
import evaluate
import pandas as pd
import numpy as np

with torch.device(device):
    huggingface_dataset_name = "knkarthick/dialogsum"
    dataset = load_dataset(huggingface_dataset_name)
    model_name='google/flan-t5-base'
    original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize_function(example):
        start_prompt = 'Summarize the following conversation.\n\n'
        end_prompt = '\n\nSummary: '
        prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
        example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
        example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

        return example

    tokenized_datasets = dataset.map(tokenize_function, batched=True)
    tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

    lora_config = LoraConfig(
        r=16, # Rank
        lora_alpha=32,
        target_modules=["q", "v"],
        lora_dropout=0.05,
        bias="none",
        task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
    )

    peft_model = get_peft_model(original_model, lora_config)

    output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

    peft_training_args = TrainingArguments(
        output_dir=output_dir,
        auto_find_batch_size=True,
        learning_rate=1e-3, # Higher learning rate than full fine-tuning.
        num_train_epochs=10,
        logging_steps=1,
        max_steps=1    
    )

    peft_trainer = Trainer(
        model=peft_model,
        args=peft_training_args,
        train_dataset=tokenized_datasets["train"],
    )

    trainer_args = {
        "resume_from_checkpoint":None,
        "trial":None,
        "ignore_keys_for_eval":None
    }

    peft_trainer.train(
        **trainer_args
    )

    peft_model_path="./peft-dialogue-summary-checkpoint-local"

    peft_trainer.model.save_pretrained(peft_model_path)
    tokenizer.save_pretrained(peft_model_path)

The code shown above throws Expected a 'cuda' device type for generator but found 'cpu'

Stack trace:
RuntimeError                              Traceback (most recent call last)
Cell In[5], [line 54](vscode-notebook-cell:?execution_count=5&line=54)
     [42](vscode-notebook-cell:?execution_count=5&line=42) peft_trainer = Trainer(
     [43](vscode-notebook-cell:?execution_count=5&line=43)     model=peft_model,
     [44](vscode-notebook-cell:?execution_count=5&line=44)     args=peft_training_args,
     [45](vscode-notebook-cell:?execution_count=5&line=45)     train_dataset=tokenized_datasets["train"],
     [46](vscode-notebook-cell:?execution_count=5&line=46) )
     [48](vscode-notebook-cell:?execution_count=5&line=48) trainer_args = {
     [49](vscode-notebook-cell:?execution_count=5&line=49)     "resume_from_checkpoint":None,
     [50](vscode-notebook-cell:?execution_count=5&line=50)     "trial":None,
     [51](vscode-notebook-cell:?execution_count=5&line=51)     "ignore_keys_for_eval":None
     [52](vscode-notebook-cell:?execution_count=5&line=52) }
---> [54](vscode-notebook-cell:?execution_count=5&line=54) peft_trainer.train(
     [55](vscode-notebook-cell:?execution_count=5&line=55)     **trainer_args
     [56](vscode-notebook-cell:?execution_count=5&line=56) )
     [58](vscode-notebook-cell:?execution_count=5&line=58) peft_model_path="./peft-dialogue-summary-checkpoint-local"
     [60](vscode-notebook-cell:?execution_count=5&line=60) peft_trainer.model.save_pretrained(peft_model_path)

File ~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   [1883](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1883)         hf_hub_utils.enable_progress_bars()
   [1884](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1884) else:
-> [1885](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1885)     return inner_training_loop(
   [1886](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1886)         args=args,
   [1887](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/transformers/trainer.py:1887)         resume_from_checkpoint=resume_from_checkpoint,
...
     [76](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/torch/utils/_device.py:76) if func in _device_constructors() and kwargs.get('device') is None:
     [77](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/torch/utils/_device.py:77)     kwargs['device'] = self.device
---> [78](https://file+.vscode-resource.vscode-cdn.net/home/diego/Documentos/Python/Local%20LLMs/~/Documentos/Python/.venv/lib/python3.12/site-packages/torch/utils/_device.py:78) return func(*args, **kwargs)

Expected behavior

Should not throw the error as the entire code is running under "with torch.device(device):" with device='cuda'

amyeroberts commented 3 months ago

cc @muellerzr @SunMarc

muellerzr commented 3 months ago

Why are we doing everything under with device()? Does it work if you remove this?

diego-coba commented 3 months ago

Thanks for looking at my issue.

Q. Why? A. When working with the large variant of the model for prediction, PyTorch doesn't use the GPU, so I had to manually move it with .to('cuda'). To avoid having to move everything manually (tokenizer, dataset, model) I started using the with device syntax. Now I'm trying to train it using PEFT with LoRA, and as my GPU has only 4GB VRAM, I used the base variant this time, keeping the manual specification for the device to be used but the error shown happens.

Q: Does it work if I remove it: A: It actually works, even when setting the device to CPU, PyTorch somehow ignores it and can, with the base variant of the model, automatically use the GPU as saw in nvidia-smi using about 3.8 GB VRAM when running the script.

So IDK why sometimes PyTorch automatically uses the GPU, others not, but for some reason when trying to force it to use the GPU with PEFT LoRa, the error happens.

For now I'm just relying on the automatic device detection. But I still think there's something not working properly somewhere.

Thanks again @muellerzr

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers