mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
143 stars 31 forks source link

Process alignments in chunks #739

Open eu9ene opened 1 month ago

eu9ene commented 1 month ago

It will be cheaper to split it into chunks and run on smaller preemptible machines rather than one big standard instance. The downside is that it will increase the complexity of the graph and will be harder to maintain.

eu9ene commented 1 month ago

Another approach would be to process it in chunks on one machine and if it's preempted, continue from the last unprocessed chunk. This approach can work but it takes longer to process compared to parallelization. One more thing to take into account is that we need to calculate priors on a large part of the original parallel corpus first before we can start processing anything in smaller chunks. I think a 100M sample might be sufficient.

eu9ene commented 1 month ago

Chunking on one machine has already been implemented in #763