sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
286 stars 32 forks source link

Question about model initialization #30

Open MAxx8371 opened 3 months ago

MAxx8371 commented 3 months ago

Does reference model, proxy model and main model have to be initialized with the same method? When continue pretraining LlaMA2 with doremi, the weights of the main model are initialized from the meta checkpoint. But for the reference model and procy model, there are not such checkpoints. Instead, these models are initialized with other methods(e.g. Xavier initialization). In this scenario, will the doamin weights of the procy model still improve the performance of the main model?