Open LucaCinquini opened 3 months ago
Suggested steps: o Manually create ECR on unity-venue-dev
It looks like this might be a good starting place for using containers with CWL: https://www.commonwl.org/user_guide/topics/using-containers.html
For security reasons, we should store the ECR endpoint (which contains the account number) as an Airflow variable, then within the DAG retrieve it and pass it to echo_message.cwl as an additional parameter.
Would it be possible to use an existing application package? I'd rather test with something real instead of something like hello world.
I can supply the existing docker containers and CWL for our example application, all we'd need to do is update the pointers to docker containers in the CWL to the ECR ones.
Let's do both... Simple "echo" first, application package after.
I was able to implement this part way but the CWL does not allow input parameters in the dockerPull
field. See this documentation for a list.
I am hoping that I can override the workflow requirements at load time but am not quite sure this will work.
@nikki-t : using override might work - but we would also have to change the script that runs the CWL workklow. Another possibility would be to update the setup task to:
@LucaCinquini - I was able to get the override to work without needing to change the script that runs the CWL workflow by converting the CWL argument YAML into JSON and including the override there.
The CWL task cannot pull the private container image from ECR. The CWL task launches a pod and runs the SPS Docker CWL image. This gets a service account of airflow-worker which has an IAM role (AirflowWorkerPolicy) associated with it. This role has permissions to access ECR:
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
But I get the following error:
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] INFO /usr/share/cwl/venv/bin/cwl-runner 3.1.20240508115724
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] Error: No such object: xxx.dkr.ecr.us-west-2.amazonaws.com/unity-nikki-1-dev-sps-busybox
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] INFO ['docker', 'pull', xxx.dkr.ecr.us-west-2.amazonaws.com/unity-nikki-1-dev-sps-busybox']
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] Using default tag: latest
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] Error response from daemon: Head "https://xxx.dkr.ecr.us-west-2.amazonaws.com/v2/unity-nikki-1-dev-sps-busybox/manifests/latest": no basic auth credentials
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] ERROR Workflow or tool uses unsupported feature:
[2024-08-19, 18:30:57 UTC] {pod_manager.py:486} INFO - [base] Docker is required to run this tool: Command '['docker', 'pull', xxx.dkr.ecr.us-west-2.amazonaws.com/unity-nikki-1-dev-sps-busybox']' returned non-zero exit status 1.
I think that the task may need to log into the ECR repo but there isn't a clear way to have this occur prior to the docker pull. Perhaps we can log in, in the entrypoint script?
It looks like the CWL DockerRequirement field can build a docker image using dockerFile
: https://www.commonwl.org/v1.2/CommandLineTool.html#DockerRequirement
After reading documentation on IAM roles for service accounts and on the Amazon EKS Pod Identity Agent, it looks like the following steps are needed to connect service accounts to IAM roles and use the EKS Pod Identity Agent:
Steps needed for to connect a service account with an IAM role:
AssumeRoleForPodIdentity
action in EKS Auth API.{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::$account_id:oidc-provider/$oidc_provider"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"$oidc_provider:aud": "sts.amazonaws.com",
"$oidc_provider:sub": "system:serviceaccount:$namespace:$service_account"
}
}
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
"Effect": "Allow",
"Principal": {
"Service": "pods.eks.amazonaws.com"
},
"Action": [
"sts:AssumeRole",
"sts:TagSession"
]
}
]
}
serviceAccountName
in spec (YAML).After putting all of these together, I am still running into theno basic auth credentials
error. I am going to pivot to trying to log into docker in the entrypoint script.
I agree... let's see if the other solution works, thanks.
I was able to use an ECR URI in the execution of a CWL DAG. I just need to finalize how we store the ECR URI. It is currently stored as an Airflow Variable with a key of cwl_dag_ecr_uri
which is what the cwl_dag_ecr.py
script looks for in order to pass the URI to the CWL DAG. The end user will need to create a variable with this name in and assign it a value with the URI of their ECR container image in the Airflow UI. We could also pass it in as an input parameter to the DAG but then the user would need to enter that every time.
@LucaCinquini - What do you think is best?
@nikki-t : what about if we store the AWS ECR base name ([XXXXXXXXXX.dkr.ecr.us-west-2.amazonaws.com]) as an Airflow variable, so it can be used by multiple DAGs. Then the specific "cwl_dag_ecr.py" will combine that with the name and version of the CWL Docker image. Other DAGs might use ECR to execute different Docker images.
@LucaCinquini - I think we would still need input parameters or Airflow variables for the name and version of the ECR container image. The full URI is:
<aws_account_id>.dkr.ecr.<aws_region>.amazonaws.com/<container_image_name>:<version>
.
So the user would have to provide the <container_image_name>
and <version>
since that might be different for different executions of the DAG. But we can certainly store the AWS ECR base name as an Airflow variable which we might be able to define in the Helm chart for Airflow.
Sorry, I didn't express myself correctly. The "cwl_dag_ecr.py" should specify the name of this Docker image: ghcr.io/unity-sds/unity-sps/sps-docker-cwl-ecr:latest, which it already does. The name and tag of the algorithm image should be part of the specific CWL workflow, then pre-pended with the ECR base URI. I think we should investigate what is CWL support for using Docker images from a private repository.
Here is the documentation on using containers with CWL. CWL does not allow parameter references in the dockerPull
key so the ECR URI needs to be built and passed to the CWL DAG definition in the cwl_dag_ecr.py
script.
So I think it makes the most sense to have the user specify the ECR URI via Airflow variable or input parameter. And to keep things as simple as possible, we can store the ECR base URI (which is also the login URL) and then the user can pass in their container image name and version as input parameters.
Realized that the current solution won't work for nested workflows, exploring different options.
Using "cwltool --pack" would create a single CWL document from the top level and nested workflows, which can then be parsed and modified as needed. Unfortunately, this does not seem to work with some of our workflows that are stored in Dockstore.
For example, this command works fine:
cwltool --pack https://raw.githubusercontent.com/common-workflow-language/cwltool/main/tests/wf/nested.cwl
But packing one of our Dockstore workflows gives an error:
cwltool --debug --pack https://raw.githubusercontent.com/unity-sds/sbg-workflows/main/L1-to-L2-e2e.cwl
ERROR I'm sorry, I couldn't load this CWL file. The error was: Traceback (most recent call last): File "/Users/cinquini/PycharmProjects/unity-sps/.venv/lib/python3.9/site-packages/cwltool/main.py", line 1114, in main print(print_pack(loadingContext, uri), file=stdout) File "/Users/cinquini/PycharmProjects/unity-sps/.venv/lib/python3.9/site-packages/cwltool/main.py", line 633, in print_pack packed = pack(loadingContext, uri) File "/Users/cinquini/PycharmProjects/unity-sps/.venv/lib/python3.9/site-packages/cwltool/pack.py", line 265, in pack v = rewrite_inputs[r] KeyError: 'http://awslbdockstorestack-lb-1429770210.us-west-2.elb.amazonaws.com:9998/api/ga4gh/trs/v2/tools/%23workflow%2Fdockstore.org%2Fmike-gangl%2FSBG-unity-fractional-cover/versions/2/PLAIN-CWL/descriptor/%2FDockstore.cwl#sbg-unity-fractional-cover'
Probably because of the "#" in the URL. Can that be avoided?
Nikki: could we encode the HTML URL # sign for the Dockstore URLs? Maybe:
I am going to do a little testing this morning (9/3).
@LucaCinquini - I see that the above is a little more complicated than I initially thought. I have run the cwltool --pack --debug
command and it does seem to get snagged at the #
in the name of this step to run: http://awslbdockstorestack-lb-1429770210.us-west-2.elb.amazonaws.com:9998/api/ga4gh/trs/v2/tools/%23workflow%2Fdockstore.org%2Fmike-gangl%2FSBG-unity-fractional-cover/versions/2/PLAIN-CWL/descriptor/%2FDockstore.cwl
Looking at the CWL definition file that it is referencing, there is a #
in the definition that gets packed: http://awslbdockstorestack-lb-1429770210.us-west-2.elb.amazonaws.com:9998/api/ga4gh/trs/v2/tools/%23workflow%2Fdockstore.org%2Fmike-gangl%2FSBG-unity-fractional-cover/versions/2/PLAIN-CWL/descriptor/%2FDockstore.cwl#sbg-unity-fractional-cover
I tried running the cwltool --pack --debug
on my local system saving the above as local files and can reproduce the error. When I remove the #
from the run key #sbg-unity-fractional-cover
it then tries to pack a file that doesn't exist. So it seems the #
is indicating another workflow and might be a relative path?
I also found a few open issues for the cwltool
that may explain what we are running into:
cwltool --validate
on the workflow)Maybe we can ask @mike-gangl if that # can be avoided while writing CWL workflows and string them in Dockstore. Or maybe we store them in GitHub?
@LucaCinquini - I think that makes sense. @mike-gangl did mention that long term there will be a docker registry owned by the system and this would remove account numbers from the CWL definition files.
I also did a test to see if the overrides would populate a nested workflow and I was able to run a successful (small) test using this workflow file: https://raw.githubusercontent.com/unity-sds/unity-sps-workflows/186-ecr-cwl-dag/demos/echo_message_ecr_nested.cwl
I can confirm that the following overrides gets passed to the nested workflow and the docker container is pulled and executed.
"cwltool:overrides": {
"https://raw.githubusercontent.com/unity-sds/unity-sps-workflows/186-ecr-cwl-dag/demos/echo_message_ecr.cwl": {
"requirements": {"DockerRequirement": {"dockerPull": ecr_uri}}
}
}
I did not update the Python script that executes the CWL DAGs, I think we should regroup on the requirements for this so that we can figure out a true test of the nested workflow overrides since it seems like there are nested, nested workflows.
It would also be good to understand how the end user will pass in the nested workflow and the best way to determine the list of nested workflows. Maybe we can also tag up on the context for these in the larger Unity SDS system?
Documenting solutions tried and results.
dockerLoad
command: https://www.commonwl.org/v1.0/CommandLineTool.html#DockerRequirement
docker load
which loads a container image from a tar archive, file, or STDIN. This doesn't apply to our use case.environment variables: https://www.commonwl.org/user_guide/topics/environment-variables.html
**INFO** ['docker', 'pull', '$ENV_CONTAINER_REGISTRY']
in the logs. The environment variable ENV_CONTAINER_REGISTRY
was not parsed.#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
baseCommand: echo
arguments: [$(inputs.message)]
requirements:
EnvVarRequirement:
envDef:
ENV_CONTAINER_REGISTRY: $(inputs.ecr_uri)
hints:
DockerRequirement:
dockerPull: $ENV_CONTAINER_REGISTRY
inputs:
message:
type: string
ecr_uri:
type: string
outputs:
the_output:
type: stdout
stdout: echo_message.txt
InlineJavascriptRequirement
JavaScript to override values: https://www.commonwl.org/user_guide/topics/expressions.html
dockerPull
.cwl:overrides
will override nested workflows: https://github.com/common-workflow-language/cwltool?tab=readme-ov-file#overriding-workflow-requirements-at-load-time
Using Docker images pulled from ECR would enable higher scalability when executing workload. We want to demonstrate how this is possible using a simple CWL workflow.