huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.93k stars 26.27k forks source link

Transformer.Trainer fails in creating optimizer for optim adamw_torch_fused when launched with deepspeed. #31867

Open princethewinner opened 2 months ago

princethewinner commented 2 months ago

System Info

Who can help?

@muellerzr

The issue arises when the script is launched with deepspeed. It seems that the model is not loaded in GPU when create_optimizer is called and thus fails in creating an optimizer.

Launch command

deepspeed --num_gpus=2 trainer_adamw_fused_test.py 

Output: image

However, setting deepspeed_dict=None and using the same launch command does not cause any error, and training continues as usual. So, I am guessing it could be caused by conflicting deepspeed settings or incorrect parsing of deepspeed settings.

Information

Tasks

Reproduction

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import numpy as np
import evaluate

from loguru import logger

class CustomTrainer(Trainer):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def create_optimizer(self):
        logger.debug("Named parameters [{}]", [b.device.type for a, b in self.model.named_parameters()])
        return super().create_optimizer()

dataset = load_dataset("yelp_review_full")
dataset["train"][100]

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)

deepspeed_dict = {
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "fp16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 1
    }
}

training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch", optim="adamw_torch_fused", deepspeed=deepspeed_dict)

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Expected behavior

Training should be completed.

amyeroberts commented 1 month ago

cc @muellerzr @SunMarc