Question about 8B model architecture

sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets

https://arxiv.org/abs/2305.10429

MIT License

286 stars 32 forks source link

Closed Qinghao-Hu closed 1 month ago

Qinghao-Hu commented 5 months ago

Thanks for your interesting work!

Could you please provide the configuration of the 8B model?

How do you generate the proxy model? Is it generated based on 8B? Or just a smaller model with arbitrary configuration (e.g., layer num)?

sangmichaelxie commented 1 month ago

The proxy model is just a smaller model (less layers, heads, hidden dims)