Closed RonanKMcGovern closed 1 month ago
Hi, could you provide an example of using datatrove with nanosets?
@TJ-Solergibert
Hi @justHungryMan & @RonanKMcGovern! In #189 I changed the supported tokenizing mechanism from the Nanoset tokenizer tool that I developed (And it's in main
) to the one using datatrove
. I recommend you checking #189 to check the changes, but here you have a little summary:
- Nanosets now support tokenized documents with
DocumentTokenizer
fromdatatrove
throughDatatroveFolderDataset
.
- Added
datatrove[io,processing]
dependency tonanosets
flavour- Refractored
tools/preprocess_data.py
to tokenize documents w/datatrove
- Updated
docs/nanoset.md
- There is 1 slightly change to the config file:
dataset_path
-->dataset_folder
- 🚨Last commit installs
datatrove
from source🚨 (from the project folder runpip install -e '.[nanosets]'
)
It was merged
yes, this should be clarified better.
for datatrove, you can actually use the new nanosets to load large pretraining datasets, and tokenize using datatrove