Closed Lauler closed 4 months ago
Hello,
Yes, there are some options available.
1 Reduce the sequence_length and micro_batch_size.
tokens = TokensArgs(sequence_length=256, train_steps=15, micro_batch_size=2, batch_accumulation_per_replica=1).
Also, ensure that max_position_embeddings equals sequence_length.
2 You can also reduce the dp number. parallelism = ParallelismArgs( dp=2, pp=2, tp=2, pp_engine="1f1b", tp_mode="REDUCE_SCATTER", tp_linear_async_communication=True, )
Please feel free to let me know if you have any other questions.
Your proposed solution increases the number of training steps required in order to consume 1 epoch of data by reducing the global batch size (via a lower micro_batch_size
and dp
setting) and by reducing the sequence length.
It however still does not allow for training on more than 1 epoch of the data.
I misunderstood your question. Could you try to repeat the DatasetStageArgs under the data_stages?
Thanks, that does indeed seem to work! It seems like a fairly recent undocumented addition to the library? I had to upgrade my configs after upgrading nanotron library to most recent version.
For anyone wondering, here's the pattern for how you specify multiple data stages in yaml:
Thanks for your comment, we will further improve the document!
What are my options if I want to train more than 1 epoch? Can I specify something in the yaml or is the only option to create a bigger dataset with repeated data?