jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.71k stars 453 forks source link

How do I use these data sets to train new models? #139

Closed win10ogod closed 7 months ago

win10ogod commented 8 months ago

How do I use these data sets to train new models? https://huggingface.co/datasets/Skywork/SkyPile-150B https://huggingface.co/datasets/EleutherAI/proof-pile-2

win10ogod commented 8 months ago

@jzhang38 Can you provide a script? I'm a little confused on how to modify the script.

ChaosCodes commented 7 months ago

Hi we are working on these two datasets, will release the scripts when we finish.