mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
137 stars 28 forks source link

Many random errors on CPU generic workers #630

Open eu9ene opened 1 month ago

eu9ene commented 1 month ago

Example runs:

https://firefox-ci-tc.services.mozilla.com/tasks/groups/b8_PxNNiSKWyuugT0V8faA https://firefox-ci-tc.services.mozilla.com/tasks/groups/BFgoyXuvRHe0kK4J9d-Lgw https://firefox-ci-tc.services.mozilla.com/tasks/groups/LphXRhZoScGYSNAPKCiVfQ

See #629

It might be related to #549

bhearsum commented 5 days ago

The common thread here seems to running out of some sort resources. Examples include:

[task 2024-05-23T23:56:57.158Z] [34/12:fasttext_filter] OpenBLAS blas_thread_init: pthread_create failed for thread 22 of 32: Resource temporarily unavailable
[task 2024-05-24T04:15:45.087Z] [24/4:deescape-special-chars] Error: can't start new thread
[task 2024-05-24T01:12:29.457Z] [81/12:fasttext_filter] OpenBLAS blas_thread_init: pthread_create failed for thread 22 of 32: Resource temporarily unavailable
[task 2024-05-24T01:12:29.457Z] [81/12:fasttext_filter] OpenBLAS blas_thread_init: RLIMIT_NPROC 1031641 current, 1031641 max

I suspect podman is enforcing some resource limits. For example, by default there's a limit of 2048 pids in the container:

--pids-limit=limit

Tune the container’s pids limit. Set to -1 to have unlimited pids for the container. The default is 2048 on systems that support “pids” cgroup controller.

(From https://docs.podman.io/en/latest/markdown/podman-run.1.html.)

I'm not sure if we're doing this intentionally or not; I'll look into it.

bhearsum commented 1 day ago

https://github.com/taskcluster/taskcluster/issues/7120 has been filed for this.

bhearsum commented 5 hours ago

taskcluster/taskcluster#7120 has been filed for this.

This is fixed. We need to wait for it to be released, and pick up a new version of generic-worker for the cpu workers before to call this ticket fixed. I may wait on updating that image until some of the other things blocking generic worker for cpu workers are dealt with, in case we have other fixes that need to get into the image.