huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.97k stars 139 forks source link

In-file parallelism #74

Open jordane95 opened 8 months ago

jordane95 commented 8 months ago

Current parallel strategy assign different files in a directory to different workers.

There are many situations where this may incur load unbalancing, for example, when the input files are irregular in size or the input is one single giant file.

Is it possible to implement the functionality of in-file parallelism? For each file, assign different lines to different workers

shizhediao commented 1 week ago

Same question. I have a large data file, which is a single jsonl. Shall I split it into smaller files to take advantage of multiple workers? Is there any straightforward way?