martinjaggi commented 2 months ago

yes, this should be clarified better.

for datatrove, you can actually use the new nanosets to load large pretraining datasets, and tokenize using datatrove

justHungryMan commented 2 months ago

Hi, could you provide an example of using datatrove with nanosets?

martinjaggi commented 2 months ago

@TJ-Solergibert

TJ-Solergibert commented 2 months ago

Hi @justHungryMan & @RonanKMcGovern! In #189 I changed the supported tokenizing mechanism from the Nanoset tokenizer tool that I developed (And it's in main) to the one using datatrove. I recommend you checking #189 to check the changes, but here you have a little summary:

Nanosets now support tokenized documents with DocumentTokenizer from datatrove through DatatroveFolderDataset.

Added datatrove[io,processing] dependency to nanosets flavour

Refractored tools/preprocess_data.py to tokenize documents w/ datatrove

Updated docs/nanoset.md

There is 1 slightly change to the config file: dataset_path --> dataset_folder

🚨Last commit installs datatrove from source🚨 (from the project folder run pip install -e '.[nanosets]')

nanotron#6

martinjaggi commented 1 month ago

It was merged

huggingface / nanotron

"datatrove" is missing from the examples folder #175

189 Should get merged soon, but in the meantime you can check swiss-ai/nanotron where we have merged the same PR swiss-ai/nanotron#6