Open linux-leo opened 3 months ago
can you post some (eval) results against edu fineweb?
I haven't done experiments and never trained a model myself with this codebase, but will do if I ever get around to it.
Note that the large majority of SmolLM is fineweb-edu, only augmented with synthetic data from cosmopedia-v2 and coding data from python-edu, which in my opinion, given that both of these sources are small compared to the fineweb-edu data, should have almost no negative impact on any benchmarks compared to pure fineweb-edu models, but maybe achieve higher scores on more academic questions and reasoning tasks.
This not a one to one comparison, but it is from the official blog post announcing smolLM (notice the comparison to karparthy GPT)
https://huggingface.co/blog/smollm
Note: I don't know what checkpoint they are comparing to, but assuming the longest trained one, smollm was still trained on more than twice the Amount of tokens. Still, I don't think that by itself explains some improvements, especially when taking model saturation into account.
From my understanding we are always trying to use the best dataset, so that's why I'm suggesting the one from the new Huggingface SmolLM: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus