Open eu9ene opened 3 months ago
Another approach would be to process it in chunks on one machine and if it's preempted, continue from the last unprocessed chunk. This approach can work but it takes longer to process compared to parallelization. One more thing to take into account is that we need to calculate priors on a large part of the original parallel corpus first before we can start processing anything in smaller chunks. I think a 100M sample might be sufficient.
Chunking on one machine has already been implemented in #763
It will be cheaper to split it into chunks and run on smaller preemptible machines rather than one big standard instance. The downside is that it will increase the complexity of the graph and will be harder to maintain.