huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.75k stars 26.45k forks source link

[Bug - I think] `data_seed` in `TrainingArguments` is unused #31818

Open RylanSchaeffer opened 2 months ago

RylanSchaeffer commented 2 months ago

System Info

Who can help?

@muellerzr @SunMarc

Information

Tasks

Reproduction

  1. Run this simple script https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py, sweeping data_seed e.g., --data_seed=0 for one run, --data_seed=1 for another run.

Expected behavior

I would expect data_seed to make a difference. However, it does not:

image

I opened an issue with TRL but after looking at transformers more closely, I can't actually find where data_seed is used. Could someone please check whether data_seed is actually used?

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

RylanSchaeffer commented 1 month ago

Bump!

amyeroberts commented 1 month ago

Gentle ping @SunMarc @muellerzr

qgallouedec commented 1 month ago

After further investigation, it seems to go beyond just data_seed, but also seed:

It seems like the Trainer is not deterministic.

Running this twice:

from datasets import load_dataset
from transformers import AutoTokenizer, TrainingArguments, AutoModelForSequenceClassification, Trainer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

dataset = load_dataset("glue", "mrpc", split="train")

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

dataset = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments("test-trainer", max_steps=200, logging_steps=10, seed=123, data_seed=456)
trainer = Trainer(
    model,
    training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()
Screenshot 2024-08-18 at 21 59 58

System info

RylanSchaeffer commented 1 month ago

I manually use transformer's set_seed() function:

from transformers import set_seed

set_seed(seed=wandb_config["seed"], deterministic=True)
RylanSchaeffer commented 1 month ago

Previously, I recall being able to use this ^ to achieve deterministism, but perhaps I am misremembering or perhaps it only works in certain cases.

qgallouedec commented 1 month ago

You're right @RylanSchaeffer. I think it would make sense to add set_seed in our script in TRL. Anyway your initial question remains:

from datasets import load_dataset
from transformers import AutoTokenizer, TrainingArguments, AutoModelForSequenceClassification, Trainer, set_seed

set_seed(123)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

dataset = load_dataset("glue", "mrpc", split="train")

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

dataset = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments("test-trainer", max_steps=200, logging_steps=10, seed=124, data_seed=789)
trainer = Trainer(
    model,
    training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()
Screenshot 2024-08-18 at 22 35 02
RylanSchaeffer commented 1 week ago

I received an automated GitHub email:

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Yes, please.

SunMarc commented 6 days ago

@MekkCyber is looking into the issue ! Thanks for your patience