Closed aerickson closed 1 year ago
I've made generic-translations-gcp-googlecompute-2023-03-23t22-32-04z in the translations-sandbox project at 448a912.
Will do some testing.
The image starts and g-w tries to register in a pool (but my test pool doesn't exist yet). nvidia-smi
isn't working. Missed the nvidia driver. Building a new image with nvidia drivers present.
Testing on nvidia T4 instance.
Got nvidia-smi
working on a started instance.
Turned out to be a DKMS kernel module issue for the nvidia-driver (broken symlinks in kernel-headers). Building a new image...
I've made a new image at 923e473 and nvidia-smi
is happy on a T4 instance (and CUDA is installed). worker-runner exits after awhile with:
Mar 24 23:39:28 instance-3 start-worker[515]: "message": "Worker pool translations/gpu does not exist\n\n---\n\n* method: registerWorker\n* errorCode: ResourceNotFound\n* statusCode: 404\n* time: 2023-03-24T23:39:29.012Z",
I think we're ready to make an image that ci-configuration will launch for testing.
generic-translations-gcp-googlecompute-2023-04-03t20-41-46z
was built @ 57691a3
generic-translations-gcp-googlecompute-2023-04-03t20-41-46z
was built @ 57691a3
nvidia-smi
still works. libcudnn* installed. singularity can start an image as ubuntu
user.
Built generic-translations-gcp-googlecompute-2023-04-27t21-49-37z
at e282bac.
Built generic-translations-gcp-googlecompute-2023-05-02t22-17-24z
at 66c09dc.
generic-translations-gcp-googlecompute-2023-05-02t23-49-50z
built at db5dbfb.
generic-translations-gcp-googlecompute-2023-05-03t16-25-28z
built at a77ab46.
We're getting green jobs with the latest image. Ready for review.
Create configuration for a Ubuntu 22.04 generic-worker image that installs CUDA and other tools for machine learning.
See https://mozilla-hub.atlassian.net/browse/RELOPS-500.
https://firefox-ci-tc.services.mozilla.com/worker-manager/translations-1%2Ft-linux-v100-gpu
Please squash merge.