huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.48k stars 26.89k forks source link

Impossible to train a model using both bf16 mixed precision training and torch compile, RuntimeError: expected mat1 and mat2 to have the same dtype #34470

Open RonanFR opened 1 week ago

RonanFR commented 1 week ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

import torch
from transformers import pipeline
from transformers import TrainingArguments, Trainer
from datasets import load_dataset

# Load classification pipeline from pretrained model
pipe = pipeline(
    "text-classification",
    model="Qwen/Qwen2.5-0.5B" ,
    model_kwargs={
        "num_labels": 5,
    },
    device_map="cuda"
)
print({p.data.dtype for p in pipe.model.parameters()})

# Load + format dataset
dataset = load_dataset("yelp_review_full")["train"].select(range(100))
def tokenize_function(examples):
    return pipe.tokenizer(
        examples["text"], 
        max_length=124, 
        padding="max_length", 
        truncation=True
    )
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Train 
training_args = TrainingArguments(
    per_device_train_batch_size=8,
    num_train_epochs=1,
    torch_compile=True, 
    bf16=True,  # use bfloat16 mixed precision training
    output_dir="/tmp/tests/test_1",
)

Expected behavior

Rocketknight1 commented 1 week ago

Hi @RonanFR, in general pipelines are inference-only, so loading the model with a pipeline and then training it is a bit odd! Can you see if you still get the issue with you initialize the model with AutoModelForSequenceClassification and AutoTokenizer instead? If you can give us some clean code without pipeline that reproduces the issue, we can investigate further.

RonanFR commented 1 week ago

Thanks for your reply @Rocketknight1 ! Indeed, I tested your suggestion and it works perfectly fine when training the last layer only (see message below).

RonanFR commented 1 week ago

@Rocketknight1 actually I went a bit fast before writing the last message. There is no issue when only the last layer score.weight is set to trainable (i.e., for which requires_grad is set to True). But if we train other layers then the same RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 seems to occur.

Minimum reproducible example:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import TrainingArguments
from transformers import Trainer
from datasets import load_dataset

# Load classification pipeline from pretrained model
model = AutoModelForSequenceClassification.from_pretrained(
    "TinyLlama/TinyLlama_v1.1",
    num_labels=5,
    device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama_v1.1")
tokenizer.add_special_tokens({"pad_token": tokenizer.eos_token})
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id

for n, p in model.named_parameters():
    if ("score" not in n) and ("q_proj" not in n):
        p.requires_grad = False

# Load + format dataset
dataset = load_dataset("yelp_review_full")["train"].select(range(100))
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        max_length=20,
        padding="max_length",
        truncation=True
    )
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Train 
training_args = TrainingArguments(
    per_device_train_batch_size=2**4,
    num_train_epochs=1,
    torch_compile=True, 
    bf16=True,
    logging_strategy="steps",
    logging_steps=1,
    output_dir="/tmp/test1",
    use_cpu=False
)
trainer = Trainer(
    model=model,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,
    args=training_args,
    tokenizer=tokenizer,
)
trainer.train()

I am selecting only the last score layer and q_proj layers in the above code, but the same problem occurs if selecting v_proj layers for instance. Only when just the score layer is trainable is the code working without errors.

I also tried with PEFT (instead of manually setting requires_grad to True on entire layers and False on others), and the same problem occurs.

Rocketknight1 commented 6 days ago

Yes, I can reproduce the issue, but only by going back to 4.45. It's unfortunately a little awkward on the latest version - there's another issue affecting Llama model training, so I can't fully reproduce the problem on main: https://github.com/huggingface/transformers/pull/34442