sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
286 stars 32 forks source link

Training time for baseline model and proxy model #17

Closed yuzc19 closed 9 months ago

yuzc19 commented 9 months ago

Hi! Thanks for your excellent work. Could you tell me the wall-clock training time when using 8 GPUs in your provided scripts so I can approximate the resources needed? Thank you!

sangmichaelxie commented 9 months ago

For a 120M model on 8 A100s and 200k training steps on The Pile, it takes 36h to train the reference model (step 1) and 44h to run DRO (step 2). If you want to save more time, you could try DoReMi with a shorter number of training steps, as well as running the DRO step for a shorter number of steps (50k) and then extrapolating the average domain weight curves forward with a power law.