huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

"datatrove" is missing from the examples folder #175

Closed RonanKMcGovern closed 1 month ago

martinjaggi commented 2 months ago

yes, this should be clarified better.

for datatrove, you can actually use the new nanosets to load large pretraining datasets, and tokenize using datatrove

justHungryMan commented 2 months ago

Hi, could you provide an example of using datatrove with nanosets?

martinjaggi commented 2 months ago

@TJ-Solergibert

TJ-Solergibert commented 2 months ago

Hi @justHungryMan & @RonanKMcGovern! In #189 I changed the supported tokenizing mechanism from the Nanoset tokenizer tool that I developed (And it's in main) to the one using datatrove. I recommend you checking #189 to check the changes, but here you have a little summary:

  • Nanosets now support tokenized documents with DocumentTokenizer from datatrove through DatatroveFolderDataset.
    • Added datatrove[io,processing] dependency to nanosets flavour
    • Refractored tools/preprocess_data.py to tokenize documents w/ datatrove
    • Updated docs/nanoset.md
  • There is 1 slightly change to the config file: dataset_path --> dataset_folder
  • 🚨Last commit installs datatrove from source🚨 (from the project folder run pip install -e '.[nanosets]')

189 Should get merged soon, but in the meantime you can check swiss-ai/nanotron where we have merged the same PR swiss-ai/nanotron#6

martinjaggi commented 1 month ago

It was merged