Edge Case Discussion - Githubissues

sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets

MIT License

286 stars 32 forks source link

The reference model doesn't affect the proxy model's objective (it's a constant), so it's a question of how this affects the domain weights \alpha. In our implementation, we update \alpha as \alpha_new \propto \alpha exp(-eta max(excess loss, 0}), so when excess loss is 0, we'll just stop updating \alpha (and continue training the proxy model with this \alpha). We output the time-averaged \alpha, so this will slowly move the average towards this final value. Intuitively, this value of \alpha allowed us to quickly reduce the excess loss, so we should just continue with this \alpha for the rest of the compute budget remaining.

sangmichaelxie / doremi

Edge Case Discussion #21