Closed gaceladri closed 3 years ago
Training with constant_warmup would be an option since it does not do learning rate decay with respect to dataset size. But I am a bit afraid of end having a poor trained model after 72H of training.
Closed since the new dataset.set_transform()
lazy loading. Thanks!
🚀 Feature request
Hello, I am trying to pretrain from scratch a custom model on bookcorpus + wikipedia + openwebtext but I only have a 1TB disk. I tried to merge 20% of each one and then reload the training on other 20% of each, but I am having issues with the learning rate scheduler. So if I hardcode the max_steps to the total size of the dataset (100% of all concatenated) it do various passes to the 20%. The same that putting 5 epochs. But I have to deal with lots of points like LambdaLR that is in pure pytorch to set the epoch, current step and all the states. It's a little pain!
Any suggestion?
Motivation
I wan to train from scratch a linear attention model with some modifications
Your contribution
The idea on how to train medium models with big datasets and regular hardware.