huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.36k stars 26.87k forks source link

Regression: TorchIterableDataset doesn't have __len__ #20089

Closed maxkrieger closed 1 year ago

maxkrieger commented 1 year ago

System Info

Who can help?

Information

Tasks

Reproduction

Run the Trainer with a dataset with streaming=True, making it iterable. To make the Trainer train_dataset work with streaming, use .with_format("torch") (as suggested in https://github.com/huggingface/datasets/issues/2583#issuecomment-874078780 and here).

A simple repro is below and in this colab

model_name = "gpt2"
output_dir = "."
from transformers import GPT2Tokenizer, GPT2Model
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
from transformers.data.data_collator import DataCollatorForLanguageModeling

dataset = load_dataset("rotten_tomatoes", split="train", streaming=True).shuffle(seed=42)

tokenizer = GPT2Tokenizer.from_pretrained(model_name)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

model = GPT2Model.from_pretrained(model_name)

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=5,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset.with_format("torch")
)

trainer.train()
trainer.save_model()

which yields

ValueError                                Traceback (most recent call last)
<ipython-input-8-de14de894c00> in <module>
      8     args=training_args,
      9     data_collator=data_collator,
---> 10     train_dataset=dataset.with_format("torch")
     11 )

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
    504 
    505         if train_dataset is not None and not has_length(train_dataset) and args.max_steps <= 0:
--> 506             raise ValueError("train_dataset does not implement __len__, max_steps has to be specified")
    507 
    508         if (

ValueError: train_dataset does not implement __len__, max_steps has to be specified

Expected behavior

TorchIterableDataset should implement __len__ but doesn't. It instead has a .dataset_size method.

sgugger commented 1 year ago

Thanks for opening the issue. What exactly is the regression here? On which version of Transformers did it work and when did it stop working?

As the error clearly states (copying the full error message would be helpful by the way), you need to use max_steps in your training arguments instead of num_train_epochs since your dataset doesn't have a length.

maxkrieger commented 1 year ago

Hey @sgugger, sorry for the missing information. According to this forum post, the fix for this exact error is to cast the dataset to a torch dataset as described above. However, the error persists. I have not bisected to figure out if it really is a regression, but it seems so from the code online.

Bearnardd commented 1 year ago

Hi @maxkrieger - in the forum post that you have provided you can see that the training_args already contains argument max_steps=1e6. In order for your sample to work correctly, you need to both set the max_steps argument as well as format your dataset for Pytorch.

Bearnardd commented 1 year ago

Additionally if I am not mistaken after specifying max_steps argument you can drop num_train_epochs since it will be override anyways .

maxkrieger commented 1 year ago

Aahh 🤦 somehow missed that parameter while reading the snippets @Bearnardd. Apologies for the lack of diligence, this is resolved.