mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Examine strategies for more efficient alignments #663

Closed gregtatum closed 3 months ago

gregtatum commented 5 months ago

I haven't looked into this too deeply, but we are failing with OOM when computing alignments with eflomal.

https://firefox-ci-tc.services.mozilla.com/tasks/WoiZo-oDQAuRuN_yTu2EKw

Perhaps there is a more efficient way to do this, or we need chunking. Right now we are just increasing machine memory size. There could also be a memory leak in the implementation. It might be worth looking into, especially when we go to train high resource languages.

[task 2024-06-02T17:19:20.158Z] /fetches/mono.en.zst : 25511 MB...     
[task 2024-06-02T17:19:20.158Z]                                                                                
[task 2024-06-02T17:19:20.158Z] /builds/worker/fetches/mono.en.zst: 26827774098 bytes 
[task 2024-06-02T17:19:21.064Z] [alignments] Using provided priors: /builds/worker/fetches/corpus.priors
[task 2024-06-02T17:19:21.064Z] [alignments] Calculating alignments...
[task 2024-06-02T18:15:25.545Z] [eflomal] Prepared 200000000 sentences for alignment
[task 2024-06-02T18:15:25.545Z] [eflomal] Reading lexical priors...
[task 2024-06-02T18:17:15.950Z] [eflomal] 15689941 (of 25390768) pairs of lexical priors used
[task 2024-06-02T18:18:28.093Z] /builds/worker/.local/lib/python3.10/site-packages/eflomal/bin/eflomal -m 3 -s /tmp/tmpoojka2as -t /tmp/tmpd_2c0qs7 -n 3 -N 0.2 -1 2 -2 1 -3 2 -f /builds/worker/artifacts/tmp/aln.fwd -r /builds/worker/artifacts/tmp/aln.rev -p /tmp/tmpgxzt5ktb
[task 2024-06-02T18:28:18.390Z] Read texts (200000000 sentences): 590.297 s
[task 2024-06-02T18:28:18.390Z] Vocabulary sizes are 21977072 (source), 14776735 (target)
[task 2024-06-02T18:29:23.088Z] Created alignment structures: 64.692 s
[task 2024-06-02T18:29:45.552Z] Created alignment structures: 87.154 s
[task 2024-06-02T18:30:12.480Z] Randomized alignment: 49.392 s
[task 2024-06-02T18:30:12.480Z] Aligning with model 1 (2 iterations)
[task 2024-06-02T18:30:32.341Z] Randomized alignment: 46.788 s
[task 2024-06-02T18:30:32.341Z] Aligning with model 1 (2 iterations)
[task 2024-06-02T18:38:40.182Z] Traceback (most recent call last):
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/checkouts/vcs/pipeline/alignments/align.py", line 227, in <module>
[task 2024-06-02T18:38:40.254Z]     main()
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/checkouts/vcs/pipeline/alignments/align.py", line 216, in main
[task 2024-06-02T18:38:40.254Z]     run(
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/checkouts/vcs/pipeline/alignments/align.py", line 53, in run
[task 2024-06-02T18:38:40.254Z]     fwd_path, rev_path = align(
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/checkouts/vcs/pipeline/alignments/align.py", line 97, in align
[task 2024-06-02T18:38:40.254Z]     aligner.align(
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/.local/lib/python3.10/site-packages/eflomal/__init__.py", line 72, in align
[task 2024-06-02T18:38:40.271Z]     align(srcf.name, trgf.name,
[task 2024-06-02T18:38:40.271Z]   File "python/eflomal/eflomal.pyx", line 161, in eflomal.cython.align
[task 2024-06-02T18:38:40.502Z]   File "/usr/lib/python3.10/subprocess.py", line 526, in run
[task 2024-06-02T18:38:40.575Z]     raise CalledProcessError(retcode, process.args,
[task 2024-06-02T18:38:40.576Z] subprocess.CalledProcessError: Command '['/builds/worker/.local/lib/python3.10/site-packages/eflomal/bin/eflomal', '-m', '3', '-s', '/tmp/tmpoojka2as', '-t', '/tmp/tmpd_2c0qs7', '-n', '3', '-N', '0.2', '-1', '2', '-2', '1', '-3', '2', '-f', '/builds/worker/artifacts/tmp/aln.fwd', '-r', '/builds/worker/artifacts/tmp/aln.rev', '-p', '/tmp/tmpgxzt5ktb']' died with <Signals.SIGKILL: 9>.
eu9ene commented 4 months ago

I think the proper tokenization can fix this. See #507

eu9ene commented 3 months ago

Proper tokenization is currently implemented and seems working for the current languages. We might split it into chunks in future, for example like here: #715 but it's not necessary until it works without it. Ideally, we would want to split it into multiple tasks and run on preemptible instances but this increases complexity. I'll add a task about this to the optimization meta issue.