Suggestion: Use smollm corpus

karpathy / llm.c

LLM training in simple, raw C/CUDA

MIT License

24.14k stars 2.7k forks source link

Suggestion: Use smollm corpus #695

Open linux-leo opened 3 months ago

linux-leo commented 3 months ago

From my understanding we are always trying to use the best dataset, so that's why I'm suggesting the one from the new Huggingface SmolLM: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus

gordicaleksa commented 3 months ago

can you post some (eval) results against edu fineweb?

linux-leo commented 3 months ago

I haven't done experiments and never trained a model myself with this codebase, but will do if I ever get around to it.

Note that the large majority of SmolLM is fineweb-edu, only augmented with synthetic data from cosmopedia-v2 and coding data from python-edu, which in my opinion, given that both of these sources are small compared to the fineweb-edu data, should have almost no negative impact on any benchmarks compared to pure fineweb-edu models, but maybe achieve higher scores on more academic questions and reasoning tasks.

linux-leo commented 2 months ago

This not a one to one comparison, but it is from the official blog post announcing smolLM (notice the comparison to karparthy GPT)

https://huggingface.co/blog/smollm

Note: I don't know what checkpoint they are comparing to, but assuming the longest trained one, smollm was still trained on more than twice the Amount of tokens. Still, I don't think that by itself explains some improvements, especially when taking model saturation into account.