sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
286 stars 32 forks source link

How do you get the model to be good at code if it downsamples code? #13

Open teknium1 opened 11 months ago

teknium1 commented 11 months ago

Question in topic..

sangmichaelxie commented 11 months ago

The algorithm tends to downsample code since it tends to have lower log perplexity overall (many tokens are predictable due to syntax, etc) so the excess losses may also be smaller, and also Github data can be pretty varying in quality. Code ability can be learned from many sources, including the web and stackexchange. However, if you have a prior on the high importance of some code domain, you can set the reference weight for code to be higher. This should decrease the reference model's perplexity on code examples, which would increase the excess loss on code.