[Bug - I think] `data_seed` in `TrainingArguments` is unused

RylanSchaeffer commented 2 months ago

System Info

transformers version: 4.42.3
Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
Python version: 3.12.4
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.32.1
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
Using GPU in script?: Yes
GPU type: NVIDIA A100-SXM4-80GB

Who can help?

@muellerzr @SunMarc

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Run this simple script https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py, sweeping data_seed e.g., --data_seed=0 for one run, --data_seed=1 for another run.

Expected behavior

I would expect data_seed to make a difference. However, it does not:

I opened an issue with TRL but after looking at transformers more closely, I can't actually find where data_seed is used. Could someone please check whether data_seed is actually used?

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

RylanSchaeffer commented 1 month ago

Bump!

amyeroberts commented 1 month ago

Gentle ping @SunMarc @muellerzr

qgallouedec commented 1 month ago

After further investigation, it seems to go beyond just data_seed, but also seed:

It seems like the Trainer is not deterministic.

Running this twice:

from datasets import load_dataset
from transformers import AutoTokenizer, TrainingArguments, AutoModelForSequenceClassification, Trainer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

dataset = load_dataset("glue", "mrpc", split="train")

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

dataset = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments("test-trainer", max_steps=200, logging_steps=10, seed=123, data_seed=456)
trainer = Trainer(
    model,
    training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()

System info

transformers version: 4.45.0.dev0 (https://github.com/huggingface/transformers/commit/52cb4034ada381fe1ffe8d428a1076e5411a8026)
Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
Python version: 3.11.9
Huggingface_hub version: 0.24.5
Safetensors version: 0.4.4
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA H100 80GB HBM3

RylanSchaeffer commented 1 month ago

I manually use transformer's set_seed() function:

from transformers import set_seed

set_seed(seed=wandb_config["seed"], deterministic=True)

RylanSchaeffer commented 1 month ago

Previously, I recall being able to use this ^ to achieve deterministism, but perhaps I am misremembering or perhaps it only works in certain cases.

qgallouedec commented 1 month ago

You're right @RylanSchaeffer. I think it would make sense to add set_seed in our script in TRL. Anyway your initial question remains:

from datasets import load_dataset
from transformers import AutoTokenizer, TrainingArguments, AutoModelForSequenceClassification, Trainer, set_seed

set_seed(123)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

dataset = load_dataset("glue", "mrpc", split="train")

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

dataset = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments("test-trainer", max_steps=200, logging_steps=10, seed=124, data_seed=789)
trainer = Trainer(
    model,
    training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()

RylanSchaeffer commented 1 week ago

I received an automated GitHub email:

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Yes, please.

SunMarc commented 6 days ago

@MekkCyber is looking into the issue ! Thanks for your patience

huggingface / transformers