sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
286 stars 32 forks source link

How many rounds do we need to converge domain weights on The Pile? #15

Open ouyangliqi opened 11 months ago

ouyangliqi commented 11 months ago

Thanks for your awesome work! I noticed that there is a optimized weights called configs/pile_doremi_r1_120M_ref:pile_baseline_50kvocab_nopack_120M.json as shown in README. Can we consider this domain weights as the result of the first round of doremi?

By comparing with the results shown in the paper, we can find that these optimized weights are far from the one reported in the paper. For example, the domain weight of Pile-CC is 0.13788709, but the result in the paper is 0.6057. And if 0.13788709 is the result of the first round, we can conclude that the increase domain weight in Pile-CC is about 0.028861896. Then we can estimate that it would take approximately 21 rounds to converge to 0.6057.

P.S. Thanks for your reply in this issue: https://github.com/sangmichaelxie/doremi/issues/11. I also want to ask how many rounds do we need to converge the domain weights on RedPajama?

Thanks.

sangmichaelxie commented 10 months ago

Yes, you can consider it to be the results of the first round, for a 50k vocab size (GPT-NeoX tokenizer) and a 120M proxy model. The script for running it is in scripts/run_pile.sh. The results in the paper are for a 256k vocab size (a Google internal tokenizer) and a 280M proxy model, and the dynamics turn out to be different, but that is also the result of 1 round of DoReMi. The 50k/120M results are more similar to the 1B proxy model results in the paper.

We only tried 2 rounds starting from uniform domain weights on RedPajama. This paper (https://arxiv.org/abs/2310.06694) uses a variant of DoReMi on RedPajama as well, with a similar resulting data balance (where C4 becomes highest).