Train more than 1 epoch?

huggingface / nanotron

Minimalistic large language model 3D-parallelism training

Apache License 2.0

1.14k stars 107 forks source link

Train more than 1 epoch? #158

Closed Lauler closed 4 months ago

Lauler commented 5 months ago

What are my options if I want to train more than 1 epoch? Can I specify something in the yaml or is the only option to create a bigger dataset with repeated data?

[0230:0]:[rank0]: AssertionError: Dataset is too small for steps (410865664 < 1310588928), Try train_steps<=3134

zzhhjjj commented 5 months ago

Hello,

Yes, there are some options available.

1 Reduce the sequence_length and micro_batch_size. tokens = TokensArgs(sequence_length=256, train_steps=15, micro_batch_size=2, batch_accumulation_per_replica=1).
Also, ensure that max_position_embeddings equals sequence_length.

2 You can also reduce the dp number. parallelism = ParallelismArgs( dp=2, pp=2, tp=2, pp_engine="1f1b", tp_mode="REDUCE_SCATTER", tp_linear_async_communication=True, )

Please feel free to let me know if you have any other questions.

Lauler commented 5 months ago

Your proposed solution increases the number of training steps required in order to consume 1 epoch of data by reducing the global batch size (via a lower micro_batch_size and dp setting) and by reducing the sequence length.

It however still does not allow for training on more than 1 epoch of the data.

zzhhjjj commented 5 months ago

I misunderstood your question. Could you try to repeat the DatasetStageArgs under the data_stages?

Lauler commented 4 months ago

Thanks, that does indeed seem to work! It seems like a fairly recent undocumented addition to the library? I had to upgrade my configs after upgrading nanotron library to most recent version.

For anyone wondering, here's the pattern for how you specify multiple data stages in yaml:

https://github.com/huggingface/nanotron/blob/693628ec17040b1b41d926ed536481889f46491c/examples/config_tiny_llama.yaml#L7-L31

zzhhjjj commented 4 months ago

Thanks for your comment, we will further improve the document!