iterative / terraform-provider-iterative

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes
https://registry.terraform.io/providers/iterative/iterative/latest/docs
Apache License 2.0
290 stars 27 forks source link

Docker credential helper for AWS ECR not being installed #627

Closed 0x2b3bfa0 closed 2 years ago

0x2b3bfa0 commented 2 years ago

The following commands in the provisioning script fail due to some unrecognized options on systemd 237, used by our default images:

https://github.com/iterative/terraform-provider-iterative/blob/28ce78188618771e3338ef4373b295c6a8e85f2b/environment/setup.sh#L21-L23

$ sudo systemd-run --same-dir --no-block --service-type=exec bash -c "$get_ecr_helper && $chmod_ecr_helper"
systemd-run: unrecognized option '--same-dir'
$ sudo systemd-run --no-block --service-type=exec bash -c "$get_ecr_helper && $chmod_ecr_helper"
Failed to start transient service unit: Invalid Type setting: exec
$ sudo systemd-run --no-block bash -c "$get_ecr_helper && $chmod_ecr_helper"
Running as unit: run-r1f026f250c4e4f2c8d0043d6e700a436.service

The --same-dir option was introduced in systemd 251 (https://github.com/systemd/systemd/pull/10887) and the exec type was also added later.

dacbd commented 2 years ago

odd since I have that command run just fine during the provisioning of instances. Perhaps there is a mismatch on cloud providers; it is most definitely a command on GCP, for this was introduced to prevent the creation of the machine from timing out. where this was taking 7-15mins to download

dacbd commented 2 years ago

I can grep the syslog for the transient service see that is has no issue:

$ journalctl -f -u run-r7125ca7a1551459492c19434aafdcc4a.service
-- Logs begin at Wed 2022-07-20 05:16:23 UTC. --
Jul 20 05:18:27 cml-3w7qrqs284 systemd[1]: Starting /usr/bin/bash -c curl https://amazon-ecr-credential-helper-releases.s3.us-east-2.amazonaws.com/0.5.0/linux-amd64/docker-credential-ecr-login --output /usr/bin/docker-credential-ecr-login && chmod 755 /usr/bin/docker-credential-ecr-login...
Jul 20 05:18:27 cml-3w7qrqs284 bash[11859]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Jul 20 05:18:27 cml-3w7qrqs284 bash[11859]:                                  Dload  Upload   Total   Spent    Left  Speed
Jul 20 05:18:27 cml-3w7qrqs284 systemd[1]: Started /usr/bin/bash -c curl https://amazon-ecr-credential-helper-releases.s3.us-east-2.amazonaws.com/0.5.0/linux-amd64/docker-credential-ecr-login --output /usr/bin/docker-credential-ecr-login && chmod 755 /usr/bin/docker-credential-ecr-login.
Jul 20 05:18:28 cml-3w7qrqs284 bash[11859]: [237B blob data]
Jul 20 05:18:28 cml-3w7qrqs284 systemd[1]: run-r7125ca7a1551459492c19434aafdcc4a.service: Succeeded.
dacbd commented 2 years ago

it does appear that there is a system mismatch: AWS: 18.04 https://github.com/iterative/terraform-provider-iterative/blob/28ce78188618771e3338ef4373b295c6a8e85f2b/iterative/aws/provider.go#L86 Azure: 18.04 https://github.com/iterative/terraform-provider-iterative/blob/28ce78188618771e3338ef4373b295c6a8e85f2b/iterative/azure/provider.go#L45 GCP: 20.04 https://github.com/iterative/terraform-provider-iterative/blob/28ce78188618771e3338ef4373b295c6a8e85f2b/iterative/gcp/provider.go#L59 k8s: using our container based on 20.04 https://github.com/iterative/cml/blob/2acfde589f4b435c5b9adf3010ef773e71a060af/Dockerfile#L1

Ubuntu ends main updates for 18.04 in less than a year now if I am reading their chart correctly? Perhaps it's time for an update?

0x2b3bfa0 commented 2 years ago

Perhaps it's time for an update?

Perhaps yes.

Regardless, the --same-dir option is futile for this use case, and --service-type=exec can be safely omitted so it falls back to simple with a similar effect. Can we remove those options?

dacbd commented 2 years ago

Agreed --same-dir is not required.

0x2b3bfa0 commented 2 years ago

Example

0x2b3bfa0 commented 2 years ago

Updated the example above to use credHelpers instead of credsStore

dacbd commented 2 years ago

I'll give this a test as well.

dacbd commented 2 years ago

it seems {"credHelpers": {"ACCOUNT.dkr.ecr.REGION.amazonaws.com": "ecr-login"}} was the missing part.

variables:
  AWS_DEFAULT_REGION: "us-west-1"
  AWS_REGISTRY: "342840881361.dkr.ecr.us-west-1.amazonaws.com"
stages:
  - deploy
  - train
deploy_job:
  stage: deploy
  when: always
  image: iterativeai/cml
  script:
    - cml-runner
      --cloud aws
      --cloud-region us-west-1
      --cloud-type t2.micro
      --labels=cml-runner

train_job:
  stage: train
  when: on_success
  needs: [deploy_job]
  image: 342840881361.dkr.ecr.us-west-1.amazonaws.com/temp2:latest
  tags:
    - cml-runner
  script:
    - apt update && apt install -y awscli
    - echo "hello"

worked without error (temp container is the latest cml img -> ghcr.io/iterative/cml:latest)

I'll note that it appears that GitLab's hosted runner cannot use private registries?