Open sacundim opened 2 years ago
Ok, I figured out the problem. I misunderstood this documentation to mean that authenticating to AWS was sufficient to be able to read from those buckets:
Note that even though the
s3://nextstrain-data/
andgs://nextstrain-data/
buckets are public, the defaults for most S3 and GS clients require some user to be authenticated, though the specific user/account doesn’t matter. [...] Bothhttps
ands3
should work out of the box in the standard Nextstrain Conda and Docker execution environments.
But that really means (and does technically say) that such authentication is necessary (not sufficient!) for the execution environment to be able to access the s3://nextstrain-data/
bucket. In my case, IAM is denying my Batch job access to the bucket for the simple reason that I didn't give my job containers permission to access them. Fix:
Since this is a money savings for your project, I think that you might likely want to document this a bit more explicitly so that people don't have to be AWS gurus.
@sacundim Thanks for digging into this and relaying your findings here! I agree the documentation here could be clarified.
What you ran into was a nuance of cross-account access in AWS. As briefly described in AWS docs about "public" access (emphasis mine):
For IAM users and role principals within your account, no other permissions are required. For principals in other accounts, they must also have identity-based permissions in their account that allow them to access your resource. This is called cross-account access.
The link describes in more detail why you needed to grant access to s3://nextstrain-data
in your own account's IAM configuration.
Something like the above should be mentioned in our docs.
Relatedly, I wish it was easier in Snakemake's S3 remote support to disable request signing for these specific S3 requests, since anonymous access works fine and avoids the issue of setting up IAM for cross-account access.
As @tsibley mentions in Pull Request #903, the Open workflow documentation recommends that users preferentially access the intermediate build assets over S3 instead of HTTPS. The documentation notes that this requires the S3 client to be authenticated with AWS:
What I observe with my own Open-based build in AWS Batch, however, is that my job is authenticated and is able to access my own private S3 buckets:
...but nevertheless gets an HTTP 403 error when the build tries to get the assets from S3:
Environment info:
Dockerfile
for my AWS Batch job, it just adds my files on top of thenextstrain/base:latest
imageThe documentation states that:
...and I don't see what I could have possibly done that breaks the Docker execution environment, so at the very least I think this would merit a documentation fix.