mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Process alignments in chunks #739

Open eu9ene opened 3 months ago

eu9ene commented 3 months ago

It will be cheaper to split it into chunks and run on smaller preemptible machines rather than one big standard instance. The downside is that it will increase the complexity of the graph and will be harder to maintain.

eu9ene commented 3 months ago

Another approach would be to process it in chunks on one machine and if it's preempted, continue from the last unprocessed chunk. This approach can work but it takes longer to process compared to parallelization. One more thing to take into account is that we need to calculate priors on a large part of the original parallel corpus first before we can start processing anything in smaller chunks. I think a 100M sample might be sufficient.

eu9ene commented 3 months ago

Chunking on one machine has already been implemented in #763