Impossible to train a model using both bf16 mixed precision training and torch compile, RuntimeError: expected mat1 and mat2 to have the same dtype

RonanFR commented 1 week ago

System Info

transformers version: 4.45.2
datasets version: 3.0.1
Platform: Linux-5.15.0-1070-aws-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.26.1
Safetensors version: 0.4.5
Accelerate version: 1.0.1
Accelerate config: not found
PyTorch version (GPU?): 2.5.0+cu118 (True)
Tensorflow version (GPU?): 2.14.1 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no
Using GPU in script?: yes
GPU type: NVIDIA A10G

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

import torch
from transformers import pipeline
from transformers import TrainingArguments, Trainer
from datasets import load_dataset

# Load classification pipeline from pretrained model
pipe = pipeline(
    "text-classification",
    model="Qwen/Qwen2.5-0.5B" ,
    model_kwargs={
        "num_labels": 5,
    },
    device_map="cuda"
)
print({p.data.dtype for p in pipe.model.parameters()})

# Load + format dataset
dataset = load_dataset("yelp_review_full")["train"].select(range(100))
def tokenize_function(examples):
    return pipe.tokenizer(
        examples["text"], 
        max_length=124, 
        padding="max_length", 
        truncation=True
    )
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Train 
training_args = TrainingArguments(
    per_device_train_batch_size=8,
    num_train_epochs=1,
    torch_compile=True, 
    bf16=True,  # use bfloat16 mixed precision training
    output_dir="/tmp/tests/test_1",
)

Expected behavior

The code attached raises RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16. When disabling torch compilation or using float32 (or doing both), everything works fine.
The problem does not seem to occur when pytorch is downgraded to version 2.4.1. I am not fully sure though because in this case another error occur: RuntimeError: invalid dtype for bias when use compile + autocast · Issue #124901 · pytorch/pytorch · GitHub 1 (at the end of the issue they mention that the problem is fixed with pytorch 2.5.0, but then the issue above occurs, I am stuck in a circular loop :sweat_smile: )
The same problem seems to occur with float16 instead of bfloat16 (but not for tensorfloat32 apparently).
The same code works perfectly well with "facebook/bart-large" instead of "Qwen/Qwen2.5-0.5B". But other models like "TinyLlama/TinyLlama_v1.1" suffer from the same issue as "Qwen/Qwen2.5-0.5B".

Rocketknight1 commented 1 week ago

Hi @RonanFR, in general pipelines are inference-only, so loading the model with a pipeline and then training it is a bit odd! Can you see if you still get the issue with you initialize the model with AutoModelForSequenceClassification and AutoTokenizer instead? If you can give us some clean code without pipeline that reproduces the issue, we can investigate further.

RonanFR commented 1 week ago

Thanks for your reply @Rocketknight1 ! Indeed, I tested your suggestion and it works ~~perfectly fine~~ when training the last layer only (see message below).

RonanFR commented 1 week ago

@Rocketknight1 actually I went a bit fast before writing the last message. There is no issue when only the last layer score.weight is set to trainable (i.e., for which requires_grad is set to True). But if we train other layers then the same RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 seems to occur.

Minimum reproducible example:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import TrainingArguments
from transformers import Trainer
from datasets import load_dataset

# Load classification pipeline from pretrained model
model = AutoModelForSequenceClassification.from_pretrained(
    "TinyLlama/TinyLlama_v1.1",
    num_labels=5,
    device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama_v1.1")
tokenizer.add_special_tokens({"pad_token": tokenizer.eos_token})
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id

for n, p in model.named_parameters():
    if ("score" not in n) and ("q_proj" not in n):
        p.requires_grad = False

# Load + format dataset
dataset = load_dataset("yelp_review_full")["train"].select(range(100))
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        max_length=20,
        padding="max_length",
        truncation=True
    )
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Train 
training_args = TrainingArguments(
    per_device_train_batch_size=2**4,
    num_train_epochs=1,
    torch_compile=True, 
    bf16=True,
    logging_strategy="steps",
    logging_steps=1,
    output_dir="/tmp/test1",
    use_cpu=False
)
trainer = Trainer(
    model=model,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,
    args=training_args,
    tokenizer=tokenizer,
)
trainer.train()

I am selecting only the last score layer and q_proj layers in the above code, but the same problem occurs if selecting v_proj layers for instance. Only when just the score layer is trainable is the code working without errors.

I also tried with PEFT (instead of manually setting requires_grad to True on entire layers and False on others), and the same problem occurs.

Rocketknight1 commented 6 days ago

Yes, I can reproduce the issue, but only by going back to 4.45. It's unfortunately a little awkward on the latest version - there's another issue affecting Llama model training, so I can't fully reproduce the problem on main: https://github.com/huggingface/transformers/pull/34442

huggingface / transformers