mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

bicleaner-ai-classify intermittently fails to download fasttext model #631

Open eu9ene opened 5 months ago

eu9ene commented 5 months ago

https://firefox-ci-tc.services.mozilla.com/tasks/IVGFh-gRSaOmOGjkaJGqsw/runs/0/logs/public/logs/live.log

[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:07:38.234457: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:07:38.234747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14784 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:04.0, compute capability: 7.0
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:07:38.858408: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:07:42,211 - INFO - Arguments processed
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:07:42,211 - INFO - Starting process
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:08:56,024 - INFO - Finished
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:08:56,025 - INFO - Total: 66366 rows
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:08:56,025 - INFO - Elapsed time 73.81 s
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:08:56,025 - INFO - Troughput: 899 rows/s
[task 2024-05-24T17:10:46.027Z] 2024-05-24 17:08:56,025 - INFO - Program finished
[fetches 2024-05-24T17:10:46.111Z] removing /home/ubuntu/tasks/task_171656999726273/fetches
[fetches 2024-05-24T17:10:47.257Z] finished
[taskcluster 2024-05-24T17:10:47.268Z]    Exit Code: 3
[taskcluster 2024-05-24T17:10:47.268Z]    User Time: 24m49.351441s
[taskcluster 2024-05-24T17:10:47.268Z]  Kernel Time: 2m25.053608s
[taskcluster 2024-05-24T17:10:47.268Z]    Wall Time: 10m49.577491702s
[taskcluster 2024-05-24T17:10:47.268Z]       Result: FAILED
[taskcluster 2024-05-24T17:10:47.268Z] === Task Finished ===
[taskcluster 2024-05-24T17:10:47.268Z] Task Duration: 10m49.579589245s
[taskcluster 2024-05-24T17:10:47.311Z] Uploading artifact public/build/XLEnt_v1_2.scored.zst from file /home/ubuntu/tasks/task_171656999726273/artifacts/XLEnt_v1_2.scored.zst with content encoding "identity", mime type "application/zstd" and expiry 2025-05-24T16:55:28.023Z
[taskcluster 2024-05-24T17:10:47.872Z] [mounts] Preserving cache: Moving "/home/ubuntu/tasks/task_171656999726273/checkouts" to "/home/ubuntu/caches/Ar7S0LJFR6yd4YAYtFVwCQ"
[taskcluster 2024-05-24T17:10:47.901Z] Uploading link artifact public/logs/live.log to artifact public/logs/live_backing.log with expiry 2025-05-24T16:55:28.023Z
[taskcluster:error] exit status 3
bhearsum commented 3 months ago

This task exited with 3, which is simply being rereported at the end of the log.

I'm pretty sure the problem is this error which happens 3 times:

[task 2024-05-24T17:04:06.806Z] 2024-05-24 17:04:01,834 - WARNING - Downloading FastText model...
[task 2024-05-24T17:04:06.806Z] Traceback (most recent call last):
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/bin/bicleaner-ai-classify", line 8, in <module>
[task 2024-05-24T17:04:06.806Z]     sys.exit(main())
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/bicleaner_ai/bicleaner_ai_classifier.py", line 119, in main
[task 2024-05-24T17:04:06.806Z]     perform_classification(args) # Main loop
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/bicleaner_ai/bicleaner_ai_classifier.py", line 108, in perform_classification
[task 2024-05-24T17:04:06.806Z]     nline = classify(args, args.input, args.output)
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/bicleaner_ai/classify.py", line 220, in classify
[task 2024-05-24T17:04:06.806Z]     hardrules = Hardrules(args)
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/hardrules/hardrules.py", line 115, in __init__
[task 2024-05-24T17:04:06.806Z]     self.fastspell_src = FastSpell(args.source_lang, mode="aggr")
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/fastspell/fastspell.py", line 76, in __init__
[task 2024-05-24T17:04:06.806Z]     self.download_fasttext()
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/fastspell/fastspell.py", line 93, in download_fasttext
[task 2024-05-24T17:04:06.806Z]     self.model = fasttext.load_model(ft_model_path)
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/fasttext/FastText.py", line 441, in load_model
[task 2024-05-24T17:04:06.806Z]     return _FastText(model_path=path)
[task 2024-05-24T17:04:06.806Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/fasttext/FastText.py", line 98, in __init__
[task 2024-05-24T17:04:06.806Z]     self.f.loadModel(model_path)
[task 2024-05-24T17:04:06.806Z] ValueError: /home/ubuntu/.local/lib/python3.10/site-packages/fastspell/lid.176.bin has wrong file format!

That script is run through parallel, which exits as follows:

       1-100 Some  of  the jobs failed. The exit status gives the number of failed jobs. If Y% is used the
             exit status is the percentage of jobs that failed.
eu9ene commented 3 months ago

Sometimes fast text fails to download the model. I ran into this issue in OpusCleaner and fixed it with pre-downloading https://github.com/mozilla/firefox-translations-training/blob/6b6b64999edee0e5bb5822bfb6d6e9d6a4e6c94f/pipeline/clean/opuscleaner/clean-corpus.sh#L34

ZJaume commented 1 week ago

If this helps, you can run fastspell-download command during installation and that will download the model to the pythonpath.