mozilla-platform-ops / monopacker

builds taskcluster worker images for AWS and GCP using packer
Mozilla Public License 2.0
6 stars 9 forks source link

vm for translations project #106

Closed aerickson closed 1 year ago

aerickson commented 1 year ago

Create configuration for a Ubuntu 22.04 generic-worker image that installs CUDA and other tools for machine learning.

See https://mozilla-hub.atlassian.net/browse/RELOPS-500.

https://firefox-ci-tc.services.mozilla.com/worker-manager/translations-1%2Ft-linux-v100-gpu

Please squash merge.

aerickson commented 1 year ago

I've made generic-translations-gcp-googlecompute-2023-03-23t22-32-04z in the translations-sandbox project at 448a912.

Will do some testing.

aerickson commented 1 year ago

The image starts and g-w tries to register in a pool (but my test pool doesn't exist yet). nvidia-smi isn't working. Missed the nvidia driver. Building a new image with nvidia drivers present.

Testing on nvidia T4 instance.

aerickson commented 1 year ago

Got nvidia-smi working on a started instance.

Turned out to be a DKMS kernel module issue for the nvidia-driver (broken symlinks in kernel-headers). Building a new image...

aerickson commented 1 year ago

I've made a new image at 923e473 and nvidia-smi is happy on a T4 instance (and CUDA is installed). worker-runner exits after awhile with:

Mar 24 23:39:28 instance-3 start-worker[515]:   "message": "Worker pool translations/gpu does not exist\n\n---\n\n* method:     registerWorker\n* errorCode:  ResourceNotFound\n* statusCode: 404\n* time:       2023-03-24T23:39:29.012Z",

I think we're ready to make an image that ci-configuration will launch for testing.

aerickson commented 1 year ago

generic-translations-gcp-googlecompute-2023-04-03t20-41-46z was built @ 57691a3

aerickson commented 1 year ago

generic-translations-gcp-googlecompute-2023-04-03t20-41-46z was built @ 57691a3

nvidia-smi still works. libcudnn* installed. singularity can start an image as ubuntu user.

aerickson commented 1 year ago

Built generic-translations-gcp-googlecompute-2023-04-27t21-49-37z at e282bac.

aerickson commented 1 year ago

Built generic-translations-gcp-googlecompute-2023-05-02t22-17-24z at 66c09dc.

Changes: https://github.com/aerickson/monopacker/compare/e282bac875228a17b1ea04101332931d2b95db9f..aerickson:monopacker:66c09dc85f42cc0d26d4d74bcc496cc4b50dea74

aerickson commented 1 year ago

generic-translations-gcp-googlecompute-2023-05-02t23-49-50z built at db5dbfb.

aerickson commented 1 year ago

generic-translations-gcp-googlecompute-2023-05-03t16-25-28z built at a77ab46.

aerickson commented 1 year ago

We're getting green jobs with the latest image. Ready for review.