Regression: TorchIterableDataset doesn't have __len__

maxkrieger commented 1 year ago

System Info

transformers version: 4.24.0
Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.15
Huggingface_hub version: 0.10.1
PyTorch version (GPU?): 1.12.1+cu113 (False)
Tensorflow version (GPU?): 2.9.2 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no (n/a)
Using distributed or parallel set-up in script?: no

Who can help?

Git blame suggests @sgugger, @anton-l
Would be great if @patil-suraj tries the whole pipeline linked because there are some other issues downstream

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Run the Trainer with a dataset with streaming=True, making it iterable. To make the Trainer train_dataset work with streaming, use .with_format("torch") (as suggested in https://github.com/huggingface/datasets/issues/2583#issuecomment-874078780 and here).

A simple repro is below and in this colab

model_name = "gpt2"
output_dir = "."
from transformers import GPT2Tokenizer, GPT2Model
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
from transformers.data.data_collator import DataCollatorForLanguageModeling

dataset = load_dataset("rotten_tomatoes", split="train", streaming=True).shuffle(seed=42)

tokenizer = GPT2Tokenizer.from_pretrained(model_name)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

model = GPT2Model.from_pretrained(model_name)

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=5,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset.with_format("torch")
)

trainer.train()
trainer.save_model()

which yields

ValueError                                Traceback (most recent call last)
<ipython-input-8-de14de894c00> in <module>
      8     args=training_args,
      9     data_collator=data_collator,
---> 10     train_dataset=dataset.with_format("torch")
     11 )

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
    504 
    505         if train_dataset is not None and not has_length(train_dataset) and args.max_steps <= 0:
--> 506             raise ValueError("train_dataset does not implement __len__, max_steps has to be specified")
    507 
    508         if (

ValueError: train_dataset does not implement __len__, max_steps has to be specified

Expected behavior

TorchIterableDataset should implement __len__ but doesn't. It instead has a .dataset_size method.

sgugger commented 1 year ago

Thanks for opening the issue. What exactly is the regression here? On which version of Transformers did it work and when did it stop working?

As the error clearly states (copying the full error message would be helpful by the way), you need to use max_steps in your training arguments instead of num_train_epochs since your dataset doesn't have a length.

maxkrieger commented 1 year ago

Hey @sgugger, sorry for the missing information. According to this forum post, the fix for this exact error is to cast the dataset to a torch dataset as described above. However, the error persists. I have not bisected to figure out if it really is a regression, but it seems so from the code online.

Bearnardd commented 1 year ago

Hi @maxkrieger - in the forum post that you have provided you can see that the training_args already contains argument max_steps=1e6. In order for your sample to work correctly, you need to both set the max_steps argument as well as format your dataset for Pytorch.

Bearnardd commented 1 year ago

Additionally if I am not mistaken after specifying max_steps argument you can drop num_train_epochs since it will be override anyways .

maxkrieger commented 1 year ago

Aahh 🤦 somehow missed that parameter while reading the snippets @Bearnardd. Apologies for the lack of diligence, this is resolved.

huggingface / transformers