Current parallel strategy assign different files in a directory to different workers.
There are many situations where this may incur load unbalancing, for example, when the input files are irregular in size or the input is one single giant file.
Is it possible to implement the functionality of in-file parallelism? For each file, assign different lines to different workers
Same question. I have a large data file, which is a single jsonl. Shall I split it into smaller files to take advantage of multiple workers? Is there any straightforward way?
Current parallel strategy assign different files in a directory to different workers.
There are many situations where this may incur load unbalancing, for example, when the input files are irregular in size or the input is one single giant file.
Is it possible to implement the functionality of in-file parallelism? For each file, assign different lines to different workers