sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
286 stars 32 forks source link

Questions about directly applying the weights from paper or the repo to train main model #23

Open clarkkent0618 opened 7 months ago

clarkkent0618 commented 7 months ago

Thanks for your solid work first! But I am wondering whether the optimized domain weights only significantly related with the tokenizer. If I use the same tokenizer and the domain weights just as that the repo released to train a main model, but with some different in other training configs, such as training steps, learning rate, global batch size and so on. Can this work? Or the training procedure must be entirely the same as the proxy and reference model? @sangmichaelxie

sangmichaelxie commented 7 months ago

Good question, the evidence so far suggests that the tokenizer is the biggest thing that changes results (since it changes the data itself). I expect some degradation if the hyperparameters of the main model and proxy model are more different, but I expect it to be fairly robust to these hyperparameter changes. For example, because the average domain weights during proxy training are pretty stable after a certain number of steps, I would expect the average to be similar if you trained the proxy model for longer. For most open models with tokenizers similar to GPT2/GPT-NeoX, I'd suggest the weights in the repo https://github.com/sangmichaelxie/doremi/blob/main/configs/pile_doremi_r1_120M_ref%3Apile_baseline_50kvocab_nopack_120M.json

clarkkent0618 commented 7 months ago

Good question, the evidence so far suggests that the tokenizer is the biggest thing that changes results (since it changes the data itself). I expect some degradation if the hyperparameters of the main model and proxy model are more different, but I expect it to be fairly robust to these hyperparameter changes. For example, because the average domain weights during proxy training are pretty stable after a certain number of steps, I would expect the average to be similar if you trained the proxy model for longer. For most open models with tokenizers similar to GPT2/GPT-NeoX, I'd suggest the weights in the repo https://github.com/sangmichaelxie/doremi/blob/main/configs/pile_doremi_r1_120M_ref%3Apile_baseline_50kvocab_nopack_120M.json

Thanks for your reply! I will try some different configs by myself and attempt to verify its robustness.