nextstrain / ncov

Nextstrain build for novel coronavirus SARS-CoV-2
https://nextstrain.org/ncov
MIT License
1.35k stars 403 forks source link

Fetching Open workflow intermediate build assets over S3 gives a 403 error in a really simple setup #909

Open sacundim opened 2 years ago

sacundim commented 2 years ago

As @tsibley mentions in Pull Request #903, the Open workflow documentation recommends that users preferentially access the intermediate build assets over S3 instead of HTTPS. The documentation notes that this requires the S3 client to be authenticated with AWS:

Note that even though the s3://nextstrain-data/ and gs://nextstrain-data/ buckets are public, the defaults for most S3 and GS clients require some user to be authenticated, though the specific user/account doesn’t matter.

What I observe with my own Open-based build in AWS Batch, however, is that my job is authenticated and is able to access my own private S3 buckets:

+ echo 'Sun Apr 10 00:18:16 UTC 2022: Checking access to destination and jobs buckets'
+ aws s3 ls s3://covid-19-puerto-rico/auspice/
Sun Apr 10 00:18:16 UTC 2022: Checking access to destination and jobs buckets
2022-03-19 23:31:04     897015 ncov_global.json
2022-03-19 23:31:04      39894 ncov_global_root-sequence.json
2022-03-19 23:31:04      47575 ncov_global_tip-frequencies.json
2022-04-03 13:44:16   81954630 ncov_puerto-rico.json
2022-04-03 13:44:16      39894 ncov_puerto-rico_root-sequence.json
2022-04-03 13:44:16    3438099 ncov_puerto-rico_tip-frequencies.json
+ aws s3 ls s3://covid-19-puerto-rico-nextstrain-jobs/

...but nevertheless gets an HTTP 403 error when the build tries to get the assets from S3:

+ echo 'Sun Apr 10 00:18:17 UTC 2022: Running the Nexstrain build'
+ snakemake --printshellcmds --profile puerto-rico_profiles/puerto-rico_open/
Sun Apr 10 00:18:17 UTC 2022: Running the Nexstrain build
Building DAG of jobs...
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/snakemake/__init__.py", line 633, in snakemake
keepincomplete=keep_incomplete,
File "/usr/local/lib/python3.7/site-packages/snakemake/workflow.py", line 565, in execute
dag.init()

[...]

File "/usr/local/lib/python3.7/site-packages/snakemake/io.py", line 262, in exists
return self.exists_remote
File "/usr/local/lib/python3.7/site-packages/snakemake/io.py", line 135, in wrapper
v = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/snakemake/io.py", line 314, in exists_remote
return self.remote_object.exists()
File "/usr/local/lib/python3.7/site-packages/snakemake/remote/S3.py", line 79, in exists
return self._s3c.exists_in_bucket(self.s3_bucket, self.s3_key)
File "/usr/local/lib/python3.7/site-packages/snakemake/remote/S3.py", line 327, in exists_in_bucket
self.s3.Object(bucket_name, key).load()
File "/usr/local/lib/python3.7/site-packages/boto3/resources/factory.py", line 564, in do_action
response = action(self, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/boto3/resources/action.py", line 88, in __call__
response = getattr(parent.meta.client, operation_name)(*args, **params)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 401, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 731, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

Environment info:

The documentation states that:

Both https and s3 should work out of the box in the standard Nextstrain Conda and Docker execution environments.

...and I don't see what I could have possibly done that breaks the Docker execution environment, so at the very least I think this would merit a documentation fix.

sacundim commented 2 years ago

Ok, I figured out the problem. I misunderstood this documentation to mean that authenticating to AWS was sufficient to be able to read from those buckets:

Note that even though the s3://nextstrain-data/ and gs://nextstrain-data/ buckets are public, the defaults for most S3 and GS clients require some user to be authenticated, though the specific user/account doesn’t matter. [...] Both https and s3 should work out of the box in the standard Nextstrain Conda and Docker execution environments.

But that really means (and does technically say) that such authentication is necessary (not sufficient!) for the execution environment to be able to access the s3://nextstrain-data/ bucket. In my case, IAM is denying my Batch job access to the bucket for the simple reason that I didn't give my job containers permission to access them. Fix:

Since this is a money savings for your project, I think that you might likely want to document this a bit more explicitly so that people don't have to be AWS gurus.

tsibley commented 2 years ago

@sacundim Thanks for digging into this and relaying your findings here! I agree the documentation here could be clarified.

What you ran into was a nuance of cross-account access in AWS. As briefly described in AWS docs about "public" access (emphasis mine):

For IAM users and role principals within your account, no other permissions are required. For principals in other accounts, they must also have identity-based permissions in their account that allow them to access your resource. This is called cross-account access.

The link describes in more detail why you needed to grant access to s3://nextstrain-data in your own account's IAM configuration.

Something like the above should be mentioned in our docs.

tsibley commented 2 years ago

Relatedly, I wish it was easier in Snakemake's S3 remote support to disable request signing for these specific S3 requests, since anonymous access works fine and avoids the issue of setting up IAM for cross-account access.