sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
286 stars 32 forks source link

Question about 8B model architecture #28

Closed Qinghao-Hu closed 1 month ago

Qinghao-Hu commented 5 months ago

Thanks for your interesting work!

Could you please provide the configuration of the 8B model?

How do you generate the proxy model? Is it generated based on 8B? Or just a smaller model with arbitrary configuration (e.g., layer num)?

sangmichaelxie commented 1 month ago

Please see Table 9 here: https://arxiv.org/pdf/2305.10429

The proxy model is just a smaller model (less layers, heads, hidden dims)