Closed gregtatum closed 3 months ago
I think the proper tokenization can fix this. See #507
Proper tokenization is currently implemented and seems working for the current languages. We might split it into chunks in future, for example like here: #715 but it's not necessary until it works without it. Ideally, we would want to split it into multiple tasks and run on preemptible instances but this increases complexity. I'll add a task about this to the optimization meta issue.
I haven't looked into this too deeply, but we are failing with OOM when computing alignments with eflomal.
https://firefox-ci-tc.services.mozilla.com/tasks/WoiZo-oDQAuRuN_yTu2EKw
Perhaps there is a more efficient way to do this, or we need chunking. Right now we are just increasing machine memory size. There could also be a memory leak in the implementation. It might be worth looking into, especially when we go to train high resource languages.