Closed getao closed 1 year ago
cc @muellerzr but we will need a full reproducer to be able to help.
Thank you. I list my code as follows:
def train_model(model, train_dataset, eval_dataset, epochs=5, batch_size=1): training_args = TrainingArguments( output_dir="outputs/", overwrite_output_dir=True, num_train_epochs=epochs, max_steps=100000, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, eval_accumulation_steps=8, save_strategy="steps", save_steps=500, evaluation_strategy="steps", eval_steps=100, logging_steps=20, logging_dir="logs", learning_rate=8e-5, gradient_accumulation_steps=8, fp16=True, do_train=True, )
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
train_result = trainer.train()
trainer.save_model()
trainer.log_metrics("train", train_result.metrics)
metrics = trainer.evaluate()
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
def tokenize_function(examples): max_len = max_txt_len + 128 output = model.tokenizer(examples["text"], truncation=True, max_length=max_len, padding=False) output["labels"] = [list(e) for e in output["input_ids"]] return output
def main(): train_file = "train.jsonl" eval_file = "valid.jsonl" dataset = load_dataset("json", data_files={"train": train_file, "eval": eval_file}, streaming=True) dataset = dataset.with_format("torch") train_dataset = dataset["train"] eval_dataset = dataset["eval"]
train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)
train_model(model, train_dataset, eval_dataset)
main()
How did this work in 4.29 if you are not providing a data_collator to the Trainer and not padding your texts?
How did this work in 4.29 if you are not providing a data_collator to the Trainer and not padding your texts?
As per_device_train_batch_size=1 in my code, it runs properly in 4.29.2 and 4.30.2 even if I didn't pad and didn't provide a data_collator.
It only fails in 4.31.0.
BTW, in 4.31.0, even if I provided a data_collator, it still fails. It only works if I pre-pad all the sequences into the same length in the tokenize_function().
Ah yes, understood. This doesn't work anymore because Accelerate will by default use dispatch_batches=True
for iterable datasets, which builds the batch on process 0 (with a batch size 4 here since you have 4 processes) then split it to send it to each GPU.
@muellerzr what we need is to surface the option dispatch_batches=False
here.
I think if you add a line trainer.accelerator.dispatch_batches=False
it will work again @getao
Ah yes, understood. This doesn't work anymore because Accelerate will by default use
dispatch_batches=True
for iterable datasets, which builds the batch on process 0 (with a batch size 4 here since you ahve 4 processes) then split it to send it to each GPU.@muellerzr what we need is to surface the option
dispatch_batches=False
here.I think if you add a line
trainer.accelerator.disaptch_batches=False
it will work again @getao
Oh, I see! Thank you very much for your help!
Thanks @getao! #25038 should solve this, once merged just set args.dispatch_batches=False
and your code should run just fine
System Info
transformers=4.31.0 pytorch=1.13.1
Who can help?
Hi @sgugger and @ArthurZucker
When I used transformers 4.29.2 and 4.30.2 with the streaming dataset and local batch size=1, I didn't pad the text sequence and everything goes well.
However, after I upgrade the transformers to 4.31.0. My previous training pipeline fails. Error messages are:
File "myenv/lib/python3.8/site-packages/accelerate/data_loader.py", line 556, in iter next_batch, next_batch_info = self._fetch_batches(main_iterator) File "myenv/lib/python3.8/site-packages/accelerate/data_loader.py", line 520, in _fetch_batches batch = concatenate(batches, dim=0) File "myenv/lib/python3.8/site-packages/accelerate/utils/operations.py", line 441, in concatenate return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()}) File "myenv/lib/python3.8/site-packages/accelerate/utils/operations.py", line 441, in
return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
File "myenv/lib/python3.8/site-packages/accelerate/utils/operations.py", line 444, in concatenate
return torch.cat(data, dim=dim)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 655 but got size 563 for tensor number 1 in the list.
I find that in the following function in data_loader.py (from accelerate), the variable "batches" contain examples with different lengths, causing the error. For example, I trained my model on 4 GPUs with local batch size=1. Then, the list batches will have 4 elements (each is a batch of 1 example). But these 4 elements may have different lengths, causing the above error when concatenating. However, as my local batch size=1, there should be no need to make the samples to be in the same length. I think it is a bug introduced in 4.31.0 because in the previous transformers version (e.g., 4.29.2 and 4.30.2), the training script can run smoothly without raising the error. I look forward to your comments and suggestions. Thank you
def _fetch_batches(self, iterator): batches, batch = None, None
On process 0, we gather the batch to dispatch.
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
dataset = load_dataset("json", data_files={"train": train_file, "eval": eval_file}, streaming=True) dataset = dataset.with_format("torch") train_dataset = dataset["train"] eval_dataset = dataset["eval"] train_dataset = train_dataset.map(tokenize_function, batched=True) eval_dataset = eval_dataset.map(tokenize_function, batched=True) train_model(model, train_dataset, eval_dataset)
Expected behavior
no error message and training smoothly