huu4ontocord / rio

Text pre-processing for NLP datasets
Apache License 2.0
11 stars 6 forks source link

Add parallel processing of datasets #5

Open huu4ontocord opened 2 years ago

huu4ontocord commented 2 years ago

If we assume we would be limited in memory and disk compared to the size of some datasets (en is around 1TB), then we want to read in a few shards at a time, and parallell process inter-shard batches - for up to the maximum # of CPU. If we have GPU, then it would be to the maximum number of GPUs. We should read/process/write in parallel. We can create many .jsonl files named by shard #, and cat them all together into shards.