Open RylanSchaeffer opened 2 months ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Bump!
Gentle ping @SunMarc @muellerzr
After further investigation, it seems to go beyond just data_seed
, but also seed
:
It seems like the Trainer is not deterministic.
Running this twice:
from datasets import load_dataset
from transformers import AutoTokenizer, TrainingArguments, AutoModelForSequenceClassification, Trainer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
dataset = load_dataset("glue", "mrpc", split="train")
def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
dataset = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments("test-trainer", max_steps=200, logging_steps=10, seed=123, data_seed=456)
trainer = Trainer(
model,
training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
transformers
version: 4.45.0.dev0 (https://github.com/huggingface/transformers/commit/52cb4034ada381fe1ffe8d428a1076e5411a8026)I manually use transformer
's set_seed()
function:
from transformers import set_seed
set_seed(seed=wandb_config["seed"], deterministic=True)
Previously, I recall being able to use this ^ to achieve deterministism, but perhaps I am misremembering or perhaps it only works in certain cases.
You're right @RylanSchaeffer. I think it would make sense to add set_seed
in our script in TRL.
Anyway your initial question remains:
from datasets import load_dataset
from transformers import AutoTokenizer, TrainingArguments, AutoModelForSequenceClassification, Trainer, set_seed
set_seed(123)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
dataset = load_dataset("glue", "mrpc", split="train")
def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
dataset = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments("test-trainer", max_steps=200, logging_steps=10, seed=124, data_seed=789)
trainer = Trainer(
model,
training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
I received an automated GitHub email:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Yes, please.
@MekkCyber is looking into the issue ! Thanks for your patience
System Info
transformers
version: 4.42.3Who can help?
@muellerzr @SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
data_seed
e.g.,--data_seed=0
for one run,--data_seed=1
for another run.Expected behavior
I would expect
data_seed
to make a difference. However, it does not:I opened an issue with TRL but after looking at
transformers
more closely, I can't actually find wheredata_seed
is used. Could someone please check whetherdata_seed
is actually used?