unity-sds / unity-sps

The Unity SDS Processing Service facilitates large-scale data processing for scientific workflows.
Apache License 2.0
2 stars 2 forks source link

Demonstrate use of ECR within an Airflow DAG #186

Open LucaCinquini opened 3 months ago

LucaCinquini commented 3 months ago

Using Docker images pulled from ECR would enable higher scalability when executing workload. We want to demonstrate how this is possible using a simple CWL workflow.

LucaCinquini commented 3 months ago

Suggested steps: o Manually create ECR on unity-venue-dev

nikki-t commented 3 months ago

It looks like this might be a good starting place for using containers with CWL: https://www.commonwl.org/user_guide/topics/using-containers.html

LucaCinquini commented 3 months ago

For security reasons, we should store the ECR endpoint (which contains the account number) as an Airflow variable, then within the DAG retrieve it and pass it to echo_message.cwl as an additional parameter.

mike-gangl commented 3 months ago

Would it be possible to use an existing application package? I'd rather test with something real instead of something like hello world.

I can supply the existing docker containers and CWL for our example application, all we'd need to do is update the pointers to docker containers in the CWL to the ECR ones.

LucaCinquini commented 3 months ago

Let's do both... Simple "echo" first, application package after.

nikki-t commented 3 months ago

I was able to implement this part way but the CWL does not allow input parameters in the dockerPull field. See this documentation for a list.

I am hoping that I can override the workflow requirements at load time but am not quite sure this will work.

LucaCinquini commented 3 months ago

@nikki-t : using override might work - but we would also have to change the script that runs the CWL workklow. Another possibility would be to update the setup task to:

nikki-t commented 3 months ago

@LucaCinquini - I was able to get the override to work without needing to change the script that runs the CWL workflow by converting the CWL argument YAML into JSON and including the override there.

nikki-t commented 3 months ago

The CWL task cannot pull the private container image from ECR. The CWL task launches a pod and runs the SPS Docker CWL image. This gets a service account of airflow-worker which has an IAM role (AirflowWorkerPolicy) associated with it. This role has permissions to access ECR:

            "ecr:GetAuthorizationToken",
            "ecr:BatchCheckLayerAvailability",
            "ecr:GetDownloadUrlForLayer",
            "ecr:BatchGetImage",

But I get the following error:

[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] INFO /usr/share/cwl/venv/bin/cwl-runner 3.1.20240508115724 
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] Error: No such object: xxx.dkr.ecr.us-west-2.amazonaws.com/unity-nikki-1-dev-sps-busybox 
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] INFO ['docker', 'pull', xxx.dkr.ecr.us-west-2.amazonaws.com/unity-nikki-1-dev-sps-busybox']
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] Using default tag: latest 
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] Error response from daemon: Head "https://xxx.dkr.ecr.us-west-2.amazonaws.com/v2/unity-nikki-1-dev-sps-busybox/manifests/latest": no basic auth credentials 
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] ERROR Workflow or tool uses unsupported feature: 
[2024-08-19, 18:30:57 UTC] {pod_manager.py:486} INFO - [base] Docker is required to run this tool: Command '['docker', 'pull', xxx.dkr.ecr.us-west-2.amazonaws.com/unity-nikki-1-dev-sps-busybox']' returned non-zero exit status 1.

I think that the task may need to log into the ECR repo but there isn't a clear way to have this occur prior to the docker pull. Perhaps we can log in, in the entrypoint script?

It looks like the CWL DockerRequirement field can build a docker image using dockerFile: https://www.commonwl.org/v1.2/CommandLineTool.html#DockerRequirement

nikki-t commented 3 months ago

After reading documentation on IAM roles for service accounts and on the Amazon EKS Pod Identity Agent, it looks like the following steps are needed to connect service accounts to IAM roles and use the EKS Pod Identity Agent:

Steps needed for to connect a service account with an IAM role:

  1. Create an IAM OIDC provider for cluster.
  2. Set up Amazon EKS Pod Identity Agent. a. Node role has permissions to do the AssumeRoleForPodIdentity action in EKS Auth API.
  3. Assign an IAM role to Kubernetes service account. a. Role must have trust relationship.
    {
    "Version": "2012-10-17",
    "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::$account_id:oidc-provider/$oidc_provider"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "$oidc_provider:aud": "sts.amazonaws.com",
          "$oidc_provider:sub": "system:serviceaccount:$namespace:$service_account"
        }
      }
    }
    ]
    }
  4. Annotate service account with IAM role.
  5. Create EKS pod identity association with an IAM role that allows service account to assume IAM role with EKS pod identity.
    {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ]
        }
    ]
    }
  6. Configure pods to access AWS services with service account. a. Define serviceAccountName in spec (YAML).

After putting all of these together, I am still running into theno basic auth credentials error. I am going to pivot to trying to log into docker in the entrypoint script.

LucaCinquini commented 3 months ago

I agree... let's see if the other solution works, thanks.

nikki-t commented 2 months ago

I was able to use an ECR URI in the execution of a CWL DAG. I just need to finalize how we store the ECR URI. It is currently stored as an Airflow Variable with a key of cwl_dag_ecr_uri which is what the cwl_dag_ecr.py script looks for in order to pass the URI to the CWL DAG. The end user will need to create a variable with this name in and assign it a value with the URI of their ECR container image in the Airflow UI. We could also pass it in as an input parameter to the DAG but then the user would need to enter that every time.

@LucaCinquini - What do you think is best?

LucaCinquini commented 2 months ago

@nikki-t : what about if we store the AWS ECR base name ([XXXXXXXXXX.dkr.ecr.us-west-2.amazonaws.com]) as an Airflow variable, so it can be used by multiple DAGs. Then the specific "cwl_dag_ecr.py" will combine that with the name and version of the CWL Docker image. Other DAGs might use ECR to execute different Docker images.

nikki-t commented 2 months ago

@LucaCinquini - I think we would still need input parameters or Airflow variables for the name and version of the ECR container image. The full URI is:

<aws_account_id>.dkr.ecr.<aws_region>.amazonaws.com/<container_image_name>:<version>.

So the user would have to provide the <container_image_name> and <version> since that might be different for different executions of the DAG. But we can certainly store the AWS ECR base name as an Airflow variable which we might be able to define in the Helm chart for Airflow.

LucaCinquini commented 2 months ago

Sorry, I didn't express myself correctly. The "cwl_dag_ecr.py" should specify the name of this Docker image: ghcr.io/unity-sds/unity-sps/sps-docker-cwl-ecr:latest, which it already does. The name and tag of the algorithm image should be part of the specific CWL workflow, then pre-pended with the ECR base URI. I think we should investigate what is CWL support for using Docker images from a private repository.

nikki-t commented 2 months ago

Here is the documentation on using containers with CWL. CWL does not allow parameter references in the dockerPull key so the ECR URI needs to be built and passed to the CWL DAG definition in the cwl_dag_ecr.py script.

So I think it makes the most sense to have the user specify the ECR URI via Airflow variable or input parameter. And to keep things as simple as possible, we can store the ECR base URI (which is also the login URL) and then the user can pass in their container image name and version as input parameters.

LucaCinquini commented 2 months ago

Realized that the current solution won't work for nested workflows, exploring different options.

LucaCinquini commented 2 months ago

Using "cwltool --pack" would create a single CWL document from the top level and nested workflows, which can then be parsed and modified as needed. Unfortunately, this does not seem to work with some of our workflows that are stored in Dockstore.

For example, this command works fine:

cwltool --pack https://raw.githubusercontent.com/common-workflow-language/cwltool/main/tests/wf/nested.cwl

But packing one of our Dockstore workflows gives an error:

cwltool --debug --pack https://raw.githubusercontent.com/unity-sds/sbg-workflows/main/L1-to-L2-e2e.cwl

ERROR I'm sorry, I couldn't load this CWL file. The error was: Traceback (most recent call last): File "/Users/cinquini/PycharmProjects/unity-sps/.venv/lib/python3.9/site-packages/cwltool/main.py", line 1114, in main print(print_pack(loadingContext, uri), file=stdout) File "/Users/cinquini/PycharmProjects/unity-sps/.venv/lib/python3.9/site-packages/cwltool/main.py", line 633, in print_pack packed = pack(loadingContext, uri) File "/Users/cinquini/PycharmProjects/unity-sps/.venv/lib/python3.9/site-packages/cwltool/pack.py", line 265, in pack v = rewrite_inputs[r] KeyError: 'http://awslbdockstorestack-lb-1429770210.us-west-2.elb.amazonaws.com:9998/api/ga4gh/trs/v2/tools/%23workflow%2Fdockstore.org%2Fmike-gangl%2FSBG-unity-fractional-cover/versions/2/PLAIN-CWL/descriptor/%2FDockstore.cwl#sbg-unity-fractional-cover'

Probably because of the "#" in the URL. Can that be avoided?

LucaCinquini commented 2 months ago

Nikki: could we encode the HTML URL # sign for the Dockstore URLs? Maybe:

http://awslbdockstorestack-lb-1429770210.us-west-2.elb.amazonaws.com:9998/api/ga4gh/trs/v2/tools/%23workflow%2Fdockstore.org%2Fmike-gangl%2FSBG-unity-fractional-cover/versions/2/PLAIN-CWL/descriptor/%2FDockstore.cwl%23sbg-unity-fractional-cover

I am going to do a little testing this morning (9/3).

nikki-t commented 2 months ago

@LucaCinquini - I see that the above is a little more complicated than I initially thought. I have run the cwltool --pack --debug command and it does seem to get snagged at the # in the name of this step to run: http://awslbdockstorestack-lb-1429770210.us-west-2.elb.amazonaws.com:9998/api/ga4gh/trs/v2/tools/%23workflow%2Fdockstore.org%2Fmike-gangl%2FSBG-unity-fractional-cover/versions/2/PLAIN-CWL/descriptor/%2FDockstore.cwl

Looking at the CWL definition file that it is referencing, there is a # in the definition that gets packed: http://awslbdockstorestack-lb-1429770210.us-west-2.elb.amazonaws.com:9998/api/ga4gh/trs/v2/tools/%23workflow%2Fdockstore.org%2Fmike-gangl%2FSBG-unity-fractional-cover/versions/2/PLAIN-CWL/descriptor/%2FDockstore.cwl#sbg-unity-fractional-cover

I tried running the cwltool --pack --debug on my local system saving the above as local files and can reproduce the error. When I remove the # from the run key #sbg-unity-fractional-cover it then tries to pack a file that doesn't exist. So it seems the # is indicating another workflow and might be a relative path?

I also found a few open issues for the cwltool that may explain what we are running into:

LucaCinquini commented 2 months ago

Maybe we can ask @mike-gangl if that # can be avoided while writing CWL workflows and string them in Dockstore. Or maybe we store them in GitHub?

nikki-t commented 2 months ago

@LucaCinquini - I think that makes sense. @mike-gangl did mention that long term there will be a docker registry owned by the system and this would remove account numbers from the CWL definition files.

I also did a test to see if the overrides would populate a nested workflow and I was able to run a successful (small) test using this workflow file: https://raw.githubusercontent.com/unity-sds/unity-sps-workflows/186-ecr-cwl-dag/demos/echo_message_ecr_nested.cwl

I can confirm that the following overrides gets passed to the nested workflow and the docker container is pulled and executed.

"cwltool:overrides": {
            "https://raw.githubusercontent.com/unity-sds/unity-sps-workflows/186-ecr-cwl-dag/demos/echo_message_ecr.cwl": {
                "requirements": {"DockerRequirement": {"dockerPull": ecr_uri}}
            }
        }

I did not update the Python script that executes the CWL DAGs, I think we should regroup on the requirements for this so that we can figure out a true test of the nested workflow overrides since it seems like there are nested, nested workflows.

It would also be good to understand how the end user will pass in the nested workflow and the best way to determine the list of nested workflows. Maybe we can also tag up on the context for these in the larger Unity SDS system?

nikki-t commented 2 months ago

Documenting solutions tried and results.

dockerLoad command: https://www.commonwl.org/v1.0/CommandLineTool.html#DockerRequirement

environment variables: https://www.commonwl.org/user_guide/topics/environment-variables.html

#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
baseCommand: echo
arguments: [$(inputs.message)]
requirements:
  EnvVarRequirement:
    envDef:
      ENV_CONTAINER_REGISTRY: $(inputs.ecr_uri)

hints:
  DockerRequirement:
    dockerPull: $ENV_CONTAINER_REGISTRY

inputs:
  message:
    type: string
  ecr_uri:
    type: string

outputs:
  the_output:
    type: stdout
stdout: echo_message.txt

InlineJavascriptRequirement JavaScript to override values: https://www.commonwl.org/user_guide/topics/expressions.html

cwl:overrideswill override nested workflows: https://github.com/common-workflow-language/cwltool?tab=readme-ov-file#overriding-workflow-requirements-at-load-time