sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
286 stars 32 forks source link

Question about optimized weights in the paper #18

Closed yuzc19 closed 8 months ago

yuzc19 commented 8 months ago

Hi! I tried to directly train the main model from the optimized weights in the paper (pile_doremi_280M_256kvocab_paper.json), and I got significantly lower results (for example, in 20k checkpoint, this model only has 4.8 acc while the baseline is 5.6) than the baseline model. Is this because this weight is found by the 280M proxy model? Shouldn't it be generalizable across models?

BTW, I just found the pile_doremi_280M_256kvocab_paper.json and pile_doremi_r1_120M_ref:pile_baseline_50kvocab_nopack_120M.json has totally different trends in weights. Does it show that there may exist many optimal weight combinations for the whole dataset? Thank you so much.

sangmichaelxie commented 8 months ago

This is mainly because the models in the paper used a Google-internal tokenizer with 256k vocab size, and it seems that the weights don't transfer as well across the tokenizers (which changes the data significantly).

I think we should expect some frontier of optimal data mixtures. One example is that the two kind of data mixtures found in the paper (from different proxy model sizes) tend to either upweight Pile-CC or OpenWebText2 the most, suggesting that they are somewhat interchangeable.

yuzc19 commented 8 months ago

Thank you for the insightful response. I totally agree that the tokenizer makes a difference.

Another issue I observed is that from Figure 7, the 760M model seems not so good to serve as the proxy model (DoReMi is lower than the baseline in this setting). So, how can we identify a relatively good model as the proxy (it seems this ability is agnostic with the model size) besides simply trial-and-error?

sangmichaelxie commented 8 months ago

That's a good question - for now we just recommend using a proxy model in the range of 120M to 280M. The reason seems to be that the DRO training isn't as effective for the larger proxy models, and 1 hypothesis is that it depends on scaling the loss of examples from different domains (which is different from actually sampling a domain more/less). There's a variant of DoReMi in the ShearedLLama paper (https://arxiv.org/abs/2310.06694) which resamples instead of rescaling the loss, and that seems to work well for larger models.

yuzc19 commented 8 months ago

Thanks for your insightful response! Problem solved!