Closed tothemoon96 closed 5 months ago
Hello, could you please trim it down to a minimal reproducer example instead of the whole codebase? Also, please explain the changes that you are making wrt train_rm_bug.py
vs train_rm.py
.
Thanks for your response, the differences between train_rm_bug.py and train_rm.py are listed as follows:
train_rm_bug.py
, I inherit TrainingArguments
and create an instance of TrainingArguments
by HfArgumentParser
.train_rm.py
, I create an instance of TrainingArguments
by the __init__
method.train_rm_bug.py
uses torchrun
and train_rm.py
uses accelerate launch
.
By the way, I have removed irrelevant files in the aforementioned GitHub repository. Now it only contains minimal reproducers.This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hello, could you please trim it down to a minimal reproducer example instead of the whole codebase? Also, please explain the changes that you are making wrt
train_rm_bug.py
vstrain_rm.py
.
I can confirm this issue still exists.
related issue reported by another guy(wrong repo though) : https://github.com/bentoml/OpenLLM/issues/236
@aiden-leong please give us a minimal reproducer script of what you have going, like asked earlier (unless itβs the same code as them)
@aiden-leong please give us a minimal reproducer script of what you have going, like asked earlier (unless itβs the same code as them)
import platform
import torch
from datasets import load_dataset
from transformers import TrainingArguments, AutoTokenizer, Trainer, BertForQuestionAnswering
def run_model():
device = torch.device("mps")
print(platform.python_version())
print(torch.cuda.is_available())
model_checkpoint = "bert-base-cased"
raw_datasets_squad = load_dataset("squad", split="train")
squad = raw_datasets_squad.train_test_split(test_size=0.2)
train_dataset_squad = squad['train']
validation_dataset_squad = squad['test']
tokenizer_squad = AutoTokenizer.from_pretrained(model_checkpoint)
aiden_model_qa = BertForQuestionAnswering.from_pretrained(model_checkpoint)
args_squad = TrainingArguments(
"bert-finetuned-squad",
evaluation_strategy="no",
save_strategy="epoch",
learning_rate=2e-5,
num_train_epochs=3,
per_device_train_batch_size=32,
max_steps=1000,
logging_steps=100,
weight_decay=0.01,
fp16=False,
push_to_hub=False,
)
trainer = Trainer(
model=aiden_model_qa,
args=args_squad,
train_dataset=train_dataset_squad,
# eval_dataset=validation_dataset_squad,
tokenizer=tokenizer_squad,
# compute_metrics=compute_metrics,
)
trainer.train()
run_model()
@aiden-leong please give us a minimal reproducer script of what you have going, like asked earlier (unless itβs the same code as them)
Colab: https://colab.research.google.com/drive/1-HySqELhFI4IqaGk7Sv_PFRcmSoZJgTa?usp=sharing
It's pretty clear that missing dataset.map
is the root cause of this issue, but maybe we can provide some hint for debugging?
ref: https://huggingface.co/docs/transformers/tasks/question_answering#preprocess
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
For me the same error started occurring when I included
model = torch.compile(model)
before training. Apparently this somehow resets the number of rows in datasets/formatting/formatting.py function query_table to 0 ... unless remove_unused_columns=False is also passed via TrainingArguments... see this issue: https://github.com/huggingface/transformers/issues/27106
(The correct way to get torch.compile working in my case seems to be by passing torch_compile=True to TrainingArguments, which does not have these weird side effects.)
Buggy output
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I have packaged my environment in
tothemoon/temp:20230917
After enter docker environment, please clone
https://github.com/tothemoon96/rlhf.git
Reproduction
Expected behavior
The normal run
train_rm.py
commented inscript/rm_test.sh
should be idientical totrain_rm_bug.py
without exceptions