huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.09k stars 27.03k forks source link

ValueError during training with streaming dataset. #33674

Closed mxjmtxrm closed 2 weeks ago

mxjmtxrm commented 1 month ago

System Info

Who can help?

@muellerzr @SunMarc

Information

Tasks

Reproduction

I want to do training with streaming dataset, as my dataset is super large. The code like the following:

dataset = load_dataset('data_path', streaming=True)
dataloader = DataCollatorForLanguageModeling(tokenizer,mlm=False)

trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=train_dataset if training_args.do_train else None,
        eval_dataset=valid_dataset if training_args.do_eval else None,
        data_collator=dataloader,
    )

I met the following error:

File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 611, in __init__
    raise ValueError(
ValueError: The train_dataset does not implement __len__, max_steps has to be specified. The number of steps needs to be known in advance for the learning rate scheduler.

How to solve this problem? or is there another way to train with large datasets?

Expected behavior

-

LysandreJik commented 1 month ago

Hello! Have you followed this example to use with iterable datasets? https://huggingface.co/docs/datasets/v2.14.5/en/stream#stream-in-a-training-loop

cc @lhoestq

lhoestq commented 1 month ago

As the error suggest you should specify max_steps. It is required to know how many steps your training should do for the learning rate, but since we often can't know the size of a dataset in advance you should specify max_steps manually.

sine2pi commented 1 month ago

Since I stripped hf code of all the valueerrors my skin has cleared up and my sleep has improved.

LysandreJik commented 1 month ago

If you remove that ValueError @sine2pi you'll get a much more arcanic error message :) What's the issue with this error? It seems pretty indicative of an issue to me, and contains the code to solve it

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.