How many rounds do we need to converge domain weights on The Pile?

sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets

MIT License

286 stars 32 forks source link

Thanks for your awesome work! I noticed that there is a optimized weights called configs/pile_doremi_r1_120M_ref:pile_baseline_50kvocab_nopack_120M.json as shown in README. Can we consider this domain weights as the result of the first round of doremi?

By comparing with the results shown in the paper, we can find that these optimized weights are far from the one reported in the paper. For example, the domain weight of Pile-CC is 0.13788709, but the result in the paper is 0.6057. And if 0.13788709 is the result of the first round, we can conclude that the increase domain weight in Pile-CC is about 0.028861896. Then we can estimate that it would take approximately 21 rounds to converge to 0.6057.

P.S. Thanks for your reply in this issue: https://github.com/sangmichaelxie/doremi/issues/11. I also want to ask how many rounds do we need to converge the domain weights on RedPajama?

Thanks.

sangmichaelxie / doremi

How many rounds do we need to converge domain weights on The Pile? #15