Snakemake running as an AWS Batch or an AWS Fargate task raises MissingInputException on the inputs stored on a S3 bucket

hessamkhoshniat commented 3 months ago

We have a Dockerized Snakemake pipeline with the input data stored on a S3 bucket snakemake-bucket:

Snakefile:


rule bwa_map:
    input:
        "data/genome.fa"
    output:
        "results/mapped/A.bam"
    shell:
        "cat {input} > {output}"

Dockerfile

FROM snakemake/snakemake:v8.15.2
RUN mamba install -c conda-forge -c bioconda snakemake-storage-plugin-s3
WORKDIR /app
COPY ./workflow ./workflow
ENV PYTHONWARNINGS="ignore:Unverified HTTPS request"
CMD ["snakemake","--default-storage-provider","s3","--default-storage-prefix","s3://snakemake-bucket","results/mapped/A.bam","--cores","1","--verbose","--printshellcmds"]

When we run the container with the following command, it downloads the input file, runs the pipeline and stores the output on the bucket successfully:

docker run -it -e SNAKEMAKE_STORAGE_S3_ACCESS_KEY=**** -e SNAKEMAKE_STORAGE_S3_SECRET_KEY=**** our-snakemake:v0.0.10 However, when we deploy it as an AWS Batch Job or AWS Fargate Task, it gives the following error immediately:

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Full Traceback (most recent call last):
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/cli.py", line 2103, in args_to_api
    dag_api.execute_workflow(
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/api.py", line 594, in execute_workflow
    workflow.execute(
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/workflow.py", line 1081, in execute
    self._build_dag()
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/workflow.py", line 1037, in _build_dag
    async_run(self.dag.init())
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/common/__init__.py", line 94, in async_run
    return asyncio.run(coroutine)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/dag.py", line 183, in init
    job = await self.update(
          ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/dag.py", line 1013, in update
    raise exceptions[0]
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/dag.py", line 970, in update
    await self.update_(
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/dag.py", line 1137, in update_
    raise MissingInputException(job, missing_input)
snakemake.exceptions.MissingInputException: Missing input files for rule bwa_map:
    output: results/mapped/A.bam
    wildcards: sample=A
    affected files:
        s3://snakemake-bucket/data/genome.fa (storage)

MissingInputException in rule bwa_map in file /app/workflow/Snakefile, line 10:
Missing input files for rule bwa_map:
    output: results/mapped/A.bam
    wildcards: sample=A
    affected files:
        s3://snakemake-bucket/data/genome.fa (storage)

The image works fine on local and also on an external VPS but it doesn't work on an AWS Fargate. The file on the bucket is accessible and downloadable from inside the container on the AWS task, checked by: /opt/conda/envs/snakemake/bin/python -c "import os ;import boto3 ;s3 = boto3.resource('s3',aws_access_key_id=os.environ.get('SNAKEMAKE_STORAGE_S3_ACCESS_KEY'), aws_secret_access_key=os.environ.get('SNAKEMAKE_STORAGE_S3_SECRET_KEY')) ;my_bucket = s3.Bucket('snakemake-bucket') ; [ my_bucket.download_file(d.key,d.key) for d in my_bucket.objects.all()];print(os.listdir())" Snakemake Docker tag: snakemake/snakemake:v8.15.2

It seems in AWS Fargate sets some environment variables including AWS_CONTAINER_CREDENTIALS_RELATIVE_URI on which boto3 decide that it needs AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID in addition to SNAKEMAKE_STORAGE_S3_SECRET_KEY and SNAKEMAKE_STORAGE_S3_ACCESS_KEY. if we want to run Snakemake in AWS Fargate we have to set all 4 variables, or we have to unset AWS_CONTAINER_CREDENTIALS_RELATIVE_URI. It's a good Idea to add the comment to the documentation that if they are using AWS, they also need to to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY : https://snakemake.readthedocs.io/en/stable/snakefiles/storage.html#credentials

jlafaye commented 3 months ago

Hello Hessam,

I am trying to run a similar setup with AWS Batch on EC2 (not Fargate). If you use the AWS Batch executor for snakemake, you will notice that the the SNAKEMAKE_STORAGE_xxx credentials are going to be passed in the 'command' option and will be logged in cloudtrail and also visible in the AWS console. This is considered bad security practice.

That is the reason why I went a different road and decided to not forward the credentials used by the snakemake 'orchestrator' process and used the 'job role' feature in AWS Batch (the job role should have the runtime permissions to run your tasks + the permissions to read from/to your input/output buckets). This can be easily be achieved as most AWS runtimes (including boto3) will load credentials from their instance metadata if none are provided.

Unfortunately the S3 plugin does not allow this but this is a trivial patch. I have submitted a PR which does exactly this: https://github.com/snakemake/snakemake-storage-plugin-s3/pull/31. You might want to give it a try. Would be great to have your feedback on it.

Please take into account that the PR is my first attempt to contribute to the snakemake ecosystem so it might not be merged as-in in the repo.

hessamkhoshniat commented 3 months ago

Hello Jlafaye, Thanks a lot for your comment and also your PR. We'll give it a try and will inform you on the outcome.

jlafaye commented 2 months ago

FYI said PR was closed but another mostly similar PR was merged instead: https://github.com/snakemake/snakemake-storage-plugin-s3/pull/33

johanneskoester commented 1 month ago

I this resolved now?

johanneskoester commented 1 month ago

No response so far. I assume this is resolved. Please reopen if the problem persists.

snakemake / snakemake-storage-plugin-s3

Snakemake running as an AWS Batch or an AWS Fargate task raises MissingInputException on the inputs stored on a S3 bucket #30