huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.44k stars 26.88k forks source link

dataloading bug after upgrading to 4.31.0 #24999

Closed getao closed 1 year ago

getao commented 1 year ago

System Info

transformers=4.31.0 pytorch=1.13.1

Who can help?

Hi @sgugger and @ArthurZucker

When I used transformers 4.29.2 and 4.30.2 with the streaming dataset and local batch size=1, I didn't pad the text sequence and everything goes well.

However, after I upgrade the transformers to 4.31.0. My previous training pipeline fails. Error messages are:

File "myenv/lib/python3.8/site-packages/accelerate/data_loader.py", line 556, in iter next_batch, next_batch_info = self._fetch_batches(main_iterator) File "myenv/lib/python3.8/site-packages/accelerate/data_loader.py", line 520, in _fetch_batches batch = concatenate(batches, dim=0) File "myenv/lib/python3.8/site-packages/accelerate/utils/operations.py", line 441, in concatenate return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()}) File "myenv/lib/python3.8/site-packages/accelerate/utils/operations.py", line 441, in return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()}) File "myenv/lib/python3.8/site-packages/accelerate/utils/operations.py", line 444, in concatenate return torch.cat(data, dim=dim) RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 655 but got size 563 for tensor number 1 in the list.

I find that in the following function in data_loader.py (from accelerate), the variable "batches" contain examples with different lengths, causing the error. For example, I trained my model on 4 GPUs with local batch size=1. Then, the list batches will have 4 elements (each is a batch of 1 example). But these 4 elements may have different lengths, causing the above error when concatenating. However, as my local batch size=1, there should be no need to make the samples to be in the same length. I think it is a bug introduced in 4.31.0 because in the previous transformers version (e.g., 4.29.2 and 4.30.2), the training script can run smoothly without raising the error. I look forward to your comments and suggestions. Thank you

def _fetch_batches(self, iterator): batches, batch = None, None

On process 0, we gather the batch to dispatch.

if self.state.process_index == 0:
    try:
        if self.split_batches:
            # One batch of the main iterator is dispatched and split.
            batch = next(iterator)
        else:
            # num_processes batches of the main iterator are concatenated then dispatched and split.
            # We add the batches one by one so we have the remainder available when drop_last=False.
            batches = []
            for _ in range(self.state.num_processes):
                batches.append(next(iterator))
            batch = concatenate(batches, dim=0)

Information

Tasks

Reproduction

dataset = load_dataset("json", data_files={"train": train_file, "eval": eval_file}, streaming=True)
 dataset = dataset.with_format("torch") 
train_dataset = dataset["train"] 
eval_dataset = dataset["eval"]

 train_dataset = train_dataset.map(tokenize_function, batched=True)
 eval_dataset = eval_dataset.map(tokenize_function, batched=True) 
train_model(model, train_dataset, eval_dataset)

Expected behavior

no error message and training smoothly

sgugger commented 1 year ago

cc @muellerzr but we will need a full reproducer to be able to help.

getao commented 1 year ago

Thank you. I list my code as follows:

def train_model(model, train_dataset, eval_dataset, epochs=5, batch_size=1): training_args = TrainingArguments( output_dir="outputs/", overwrite_output_dir=True, num_train_epochs=epochs, max_steps=100000, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, eval_accumulation_steps=8, save_strategy="steps", save_steps=500, evaluation_strategy="steps", eval_steps=100, logging_steps=20, logging_dir="logs", learning_rate=8e-5, gradient_accumulation_steps=8, fp16=True, do_train=True, )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

train_result = trainer.train()
trainer.save_model()
trainer.log_metrics("train", train_result.metrics)
metrics = trainer.evaluate()
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

def tokenize_function(examples): max_len = max_txt_len + 128 output = model.tokenizer(examples["text"], truncation=True, max_length=max_len, padding=False) output["labels"] = [list(e) for e in output["input_ids"]] return output

def main(): train_file = "train.jsonl" eval_file = "valid.jsonl" dataset = load_dataset("json", data_files={"train": train_file, "eval": eval_file}, streaming=True) dataset = dataset.with_format("torch") train_dataset = dataset["train"] eval_dataset = dataset["eval"]

train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)
train_model(model, train_dataset, eval_dataset)

main()

sgugger commented 1 year ago

How did this work in 4.29 if you are not providing a data_collator to the Trainer and not padding your texts?

getao commented 1 year ago

How did this work in 4.29 if you are not providing a data_collator to the Trainer and not padding your texts?

As per_device_train_batch_size=1 in my code, it runs properly in 4.29.2 and 4.30.2 even if I didn't pad and didn't provide a data_collator.

It only fails in 4.31.0.

BTW, in 4.31.0, even if I provided a data_collator, it still fails. It only works if I pre-pad all the sequences into the same length in the tokenize_function().

sgugger commented 1 year ago

Ah yes, understood. This doesn't work anymore because Accelerate will by default use dispatch_batches=True for iterable datasets, which builds the batch on process 0 (with a batch size 4 here since you have 4 processes) then split it to send it to each GPU.

@muellerzr what we need is to surface the option dispatch_batches=False here.

I think if you add a line trainer.accelerator.dispatch_batches=False it will work again @getao

getao commented 1 year ago

Ah yes, understood. This doesn't work anymore because Accelerate will by default use dispatch_batches=True for iterable datasets, which builds the batch on process 0 (with a batch size 4 here since you ahve 4 processes) then split it to send it to each GPU.

@muellerzr what we need is to surface the option dispatch_batches=False here.

I think if you add a line trainer.accelerator.disaptch_batches=False it will work again @getao

Oh, I see! Thank you very much for your help!

muellerzr commented 1 year ago

Thanks @getao! #25038 should solve this, once merged just set args.dispatch_batches=False and your code should run just fine