sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
286 stars 32 forks source link

Should reference model initialize weights uniformly? #11

Closed ouyangliqi closed 11 months ago

ouyangliqi commented 11 months ago

Thanks for your awesome work! I noticed that the weights of the baseline model for Redpajama are not initialized uniformly. Does this mean that the reference model can be initialized with any settings?

https://github.com/sangmichaelxie/doremi/blob/main/configs/rp_baseline.json

sangmichaelxie commented 11 months ago

Yes, the reference model can be trained with any set of domain weights (it's a hyperparameter). A reasonable choice could be to use domain weights that are computed according to the size of the domains (which is done here for RedPajama). If you start from uniform weights, we find that we often need 2 rounds of iterated DoReMi because the reference model trained on uniform weights isn't very good.

ouyangliqi commented 11 months ago

Thank you so much for your reply.

sangmichaelxie commented 11 months ago

Thank you for your prompt reply! I have one more question. I'm using my own dataset, which is also RedPajama but organized differently. Even after the third round of iterated DoReMi, I've noticed that the domain weights still haven't converged. What might affect the number of rounds I need?

In my experience with running RedPajama (albeit with an older version of the code), the weights for common crawl should typically go down while c4 goes up (even if common crawl is very high in the beginning). If you do iterated DoReMi from uniform weights (for 2 rounds) I've seen that c4 becomes the dominant domain. You might find that you need more rounds particularly if the initial reference domain weights are far from the converged value.

On Wed, Sep 6, 2023 at 1:01 AM ChloeAuYeung @.***> wrote:

Thank you for your prompt reply! I have one more question. I'm using my own dataset, which is also RedPajama but organized differently. Even after the third round of iterated DoReMi, I've noticed that the domain weights still haven't converged. What might affect the number of rounds I need?

P.S. I'm using the same training settings as in run_rp_doremi280M.sh, except I had to lower the per_device_train_batch_size to 20 due to hardware limitations. Following is my result of each round. Round 1 Round 2 Round 3 common_crawl 0.661 0.580 0.592 c4 0.187 0.162 0.186 github 0.024 0.052 0.040 wikipedia 0.062 0.088 0.086 book 0.037 0.036 0.042 arxiv 0.013 0.040 0.025 stackexchange 0.017 0.043 0.028

— Reply to this email directly, view it on GitHub https://github.com/sangmichaelxie/doremi/issues/11#issuecomment-1707860697, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4J2SVBYSCSODNDMSXUN3DXZAUXLANCNFSM6AAAAAA4LJMSBY . You are receiving this because you commented.Message ID: @.***>