If we assume we would be limited in memory and disk compared to the size of some datasets (en is around 1TB), then we want to read in a few shards at a time, and parallell process inter-shard batches - for up to the maximum # of CPU. If we have GPU, then it would be to the maximum number of GPUs. We should read/process/write in parallel. We can create many .jsonl files named by shard #, and cat them all together into shards.
If we assume we would be limited in memory and disk compared to the size of some datasets (en is around 1TB), then we want to read in a few shards at a time, and parallell process inter-shard batches - for up to the maximum # of CPU. If we have GPU, then it would be to the maximum number of GPUs. We should read/process/write in parallel. We can create many .jsonl files named by shard #, and cat them all together into shards.