mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
135 stars 28 forks source link

Examine strategies for more efficient alignments #663

Open gregtatum opened 3 weeks ago

gregtatum commented 3 weeks ago

I haven't looked into this too deeply, but we are failing with OOM when computing alignments with eflomal.

https://firefox-ci-tc.services.mozilla.com/tasks/WoiZo-oDQAuRuN_yTu2EKw

Perhaps there is a more efficient way to do this, or we need chunking. Right now we are just increasing machine memory size. There could also be a memory leak in the implementation. It might be worth looking into, especially when we go to train high resource languages.

[task 2024-06-02T17:19:20.158Z] /fetches/mono.en.zst : 25511 MB...     
[task 2024-06-02T17:19:20.158Z]                                                                                
[task 2024-06-02T17:19:20.158Z] /builds/worker/fetches/mono.en.zst: 26827774098 bytes 
[task 2024-06-02T17:19:21.064Z] [alignments] Using provided priors: /builds/worker/fetches/corpus.priors
[task 2024-06-02T17:19:21.064Z] [alignments] Calculating alignments...
[task 2024-06-02T18:15:25.545Z] [eflomal] Prepared 200000000 sentences for alignment
[task 2024-06-02T18:15:25.545Z] [eflomal] Reading lexical priors...
[task 2024-06-02T18:17:15.950Z] [eflomal] 15689941 (of 25390768) pairs of lexical priors used
[task 2024-06-02T18:18:28.093Z] /builds/worker/.local/lib/python3.10/site-packages/eflomal/bin/eflomal -m 3 -s /tmp/tmpoojka2as -t /tmp/tmpd_2c0qs7 -n 3 -N 0.2 -1 2 -2 1 -3 2 -f /builds/worker/artifacts/tmp/aln.fwd -r /builds/worker/artifacts/tmp/aln.rev -p /tmp/tmpgxzt5ktb
[task 2024-06-02T18:28:18.390Z] Read texts (200000000 sentences): 590.297 s
[task 2024-06-02T18:28:18.390Z] Vocabulary sizes are 21977072 (source), 14776735 (target)
[task 2024-06-02T18:29:23.088Z] Created alignment structures: 64.692 s
[task 2024-06-02T18:29:45.552Z] Created alignment structures: 87.154 s
[task 2024-06-02T18:30:12.480Z] Randomized alignment: 49.392 s
[task 2024-06-02T18:30:12.480Z] Aligning with model 1 (2 iterations)
[task 2024-06-02T18:30:32.341Z] Randomized alignment: 46.788 s
[task 2024-06-02T18:30:32.341Z] Aligning with model 1 (2 iterations)
[task 2024-06-02T18:38:40.182Z] Traceback (most recent call last):
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/checkouts/vcs/pipeline/alignments/align.py", line 227, in <module>
[task 2024-06-02T18:38:40.254Z]     main()
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/checkouts/vcs/pipeline/alignments/align.py", line 216, in main
[task 2024-06-02T18:38:40.254Z]     run(
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/checkouts/vcs/pipeline/alignments/align.py", line 53, in run
[task 2024-06-02T18:38:40.254Z]     fwd_path, rev_path = align(
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/checkouts/vcs/pipeline/alignments/align.py", line 97, in align
[task 2024-06-02T18:38:40.254Z]     aligner.align(
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/.local/lib/python3.10/site-packages/eflomal/__init__.py", line 72, in align
[task 2024-06-02T18:38:40.271Z]     align(srcf.name, trgf.name,
[task 2024-06-02T18:38:40.271Z]   File "python/eflomal/eflomal.pyx", line 161, in eflomal.cython.align
[task 2024-06-02T18:38:40.502Z]   File "/usr/lib/python3.10/subprocess.py", line 526, in run
[task 2024-06-02T18:38:40.575Z]     raise CalledProcessError(retcode, process.args,
[task 2024-06-02T18:38:40.576Z] subprocess.CalledProcessError: Command '['/builds/worker/.local/lib/python3.10/site-packages/eflomal/bin/eflomal', '-m', '3', '-s', '/tmp/tmpoojka2as', '-t', '/tmp/tmpd_2c0qs7', '-n', '3', '-N', '0.2', '-1', '2', '-2', '1', '-3', '2', '-f', '/builds/worker/artifacts/tmp/aln.fwd', '-r', '/builds/worker/artifacts/tmp/aln.rev', '-p', '/tmp/tmpgxzt5ktb']' died with <Signals.SIGKILL: 9>.
eu9ene commented 2 weeks ago

I think the proper tokenization can fix this. See #507