openlm-research / open_llama

OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset
Apache License 2.0
7.29k stars 372 forks source link

Have you seen the new SlimPajama dataset? #45

Open xzuyn opened 1 year ago

xzuyn commented 1 year ago

Cerebras Blog Post

Hugging Face Dataset

It's RedPajama, but deduplicated down to 672B tokens using MinHashLSH. May be worth using instead as it should be more compute efficient.

SlimPajama