huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.61k stars 26.42k forks source link

How to train on shards of bookcorpus + wikipedia + openwebtext on 1 TB disk. #9986

Closed gaceladri closed 3 years ago

gaceladri commented 3 years ago

🚀 Feature request

Hello, I am trying to pretrain from scratch a custom model on bookcorpus + wikipedia + openwebtext but I only have a 1TB disk. I tried to merge 20% of each one and then reload the training on other 20% of each, but I am having issues with the learning rate scheduler. So if I hardcode the max_steps to the total size of the dataset (100% of all concatenated) it do various passes to the 20%. The same that putting 5 epochs. But I have to deal with lots of points like LambdaLR that is in pure pytorch to set the epoch, current step and all the states. It's a little pain!

Any suggestion?

Motivation

I wan to train from scratch a linear attention model with some modifications

Your contribution

The idea on how to train medium models with big datasets and regular hardware.

gaceladri commented 3 years ago

Training with constant_warmup would be an option since it does not do learning rate decay with respect to dataset size. But I am a bit afraid of end having a poor trained model after 72H of training.

gaceladri commented 3 years ago

Closed since the new dataset.set_transform() lazy loading. Thanks!