How do you get the model to be good at code if it downsamples code?

The algorithm tends to downsample code since it tends to have lower log perplexity overall (many tokens are predictable due to syntax, etc) so the excess losses may also be smaller, and also Github data can be pretty varying in quality. Code ability can be learned from many sources, including the web and stackexchange. However, if you have a prior on the high importance of some code domain, you can set the reference weight for code to be higher. This should decrease the reference model's perplexity on code examples, which would increase the excess loss on code.

sangmichaelxie / doremi

How do you get the model to be good at code if it downsamples code? #13