mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Process alignments in chunks #763

Closed eu9ene closed 3 months ago

eu9ene commented 3 months ago

Processing in 100M chunks helps to prevent OOM and allows using a smaller machine. It also makes it scalable to use with higher resource languages.

It completed successfully for the en-uk language pair that failed before: log.

We can also potentially implement a mechanism with starting from the last unprocessed part on pre-emption if we don't want to split it into multiple Taskcluster tasks to reduce the complexity of the graph.

Currently, I have to use standard instances because it can take several days on a larger student corpus (400M sentences):

  worker-classes:
    default: gcp-spot
    alignments-original: gcp-standard
    alignments-backtranslated: gcp-standard
    alignments-student: gcp-standard
    shortlist: gcp-standard

Closes #721