pangeo-data / pangeo-cloud-federation

Deployment automation for Pangeo JupyterHubs on AWS, Google, and Azure
https://pangeo.io/cloud.html
58 stars 32 forks source link

User-level permissions for pod access to S3 buckets #247

Open apawloski opened 5 years ago

apawloski commented 5 years ago

As a user, I'd like to use an S3 bucket (or a prefix within a shared bucket) as a storage option for my work. Ideally, that would be something that had access control such that only users with correct permissions can interact with it.

This is definitely possible from an AWS IAM policy perspective. For example: https://aws.amazon.com/premiumsupport/knowledge-center/iam-s3-user-specific-folder/

The challenge is that while we can give this permission at an instance level (via IAM Instance Profiles), multiple users' pods may end up on the same underlying instance. Thus a pod could access any co-resident pod's S3 bucket/prefix.

Another option would be to use string credentials for users. It would be important for these to be scoped to S3 actions/conditions only, and only from our cluster's CIDR block. Then we could inject the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY env vars into a users' pod. But I'm unsure of how the actual implementation would work -- specifically, what would inject those env vars to a pod, and how might it do that?

There may be other options as well. I'm especially curious about @yuvipanda and @jacobtomlinson thoughts on this.

rabernat commented 5 years ago

This is important. Same issues apply to other clouds: GCP, Azure, etc. Comes up often, also in the context of writing data to object store.

From a user's perspective, it would be nice to manage these credentials in jupyterlab. Would it be feasible to create a cloud credentials manager jupyerlab extension? I'm sure there are all sorts of security challenges...

yuvipanda commented 5 years ago

I am working on this problem right now for https://github.com/berkeley-dsep-infra/datahub/issues/637. My deadline is in a couple weeks, so I’ll play with a few options and keep you posted!

jacobtomlinson commented 5 years ago

We use kube2iam for this. Currently we give everyone the same role at the pod level, but this could definitely be done in a more fine grained way using Jupyter Hub.

jacobtomlinson commented 5 years ago

@rabernat that sounds like a neat idea.

Often when a user is managing their own credentials they are stored in . files in their home directory. An extension would be nice to manage this. If users were looking after their own keys then I don't see many security issues.

scottyhq commented 5 years ago

This is coming up as an issue currently during our icesat2 hackweek at UW. People often want to move large amounts of files TO and FROM the hub. For now, we have a fully public S3 bucket to facilitate this, which is certainly not ideal. Another option is adding rclone (https://rclone.org/) to our images so that users can move things between their storage of choice. Have others done this? Of course, user credentials would be accessible by administrators, but not other hub users.

apawloski commented 5 years ago

At a minimum, I’d recommend locking down the bucket’s policy to the icesat2 cluster’s VPC (either by VPC ID or the VPC’s CIDR block). It will still be open RW, but not to the whole world. Only from the cluster.

I’m on leave this week, but this is an example policy: https://aws.amazon.com/premiumsupport/knowledge-center/block-s3-traffic-vpc-ip/

If users BYO credentials, we may want to curate/distribute an IAM policy they can plug and play. Specifically one that minimizes actions and is well constrained (e.g. S3 gets, puts, from specific IP ranges) — that way, if a user does leak his or her creds, the attack surface is reduced.

On Wed, Jun 19, 2019 at 7:17 PM Scott Henderson notifications@github.com wrote:

This is coming up as an issue currently during our icesat2 hackweek at UW. People often want to move large amounts of files TO and FROM the hub. For now, we have a fully public S3 bucket to facilitate this, which is certainly not ideal. Another option is adding rclone (https://rclone.org/) to our images so that users can move things between their storage of choice. Have others done this? Of course, user credentials would be accessible by administrators, but not other hub users.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/247?email_source=notifications&email_token=AAL3P3UTBCRPU5X5SJ37ZEDP3JS35A5CNFSM4HI2XNN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCRYYY#issuecomment-503651427, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL3P3RKZ4MX7YOEJVRJP5TP3JS35ANCNFSM4HI2XNNQ .

scottyhq commented 5 years ago

Good ideas @apawloski. We originally had S3 accesses limited to the VPC, but users want to upload data from their laptops or servers at all kinds of IP addresses. Many of these users are new to aws and do not have their own credentials.

I've been thinking another solution is to use STS to pass out temporary credentials. An administrator can do this every 24 hours, but that is annoying.

I think we could modify the EC2 role so that users can run the command to get 24-hour credentials that they could use to access the hub bucket from their laptop?

amanda-tan commented 5 years ago

Can you use roles outside of the AWS ecosystem?

rabernat commented 5 years ago

IMO this is a really important general problem to solve if we want to scale and operationalize the Pangeo approach. Following with interest.

apawloski commented 5 years ago

Yes it’s worth noting that access from the cluster and access from users’ machines elsewhere can be through different mechanisms.

So the cluster is cleared to RW via the bucket policy, but the users have their individual access keys to RW from their workstations. (Either through the sts assume strategy you’re talking about, or with policies directly on their keys.)

The advantage of having both methods is that you don’t have to store credentials on the cluster.

Of course none of this addresses the namespacing situation where one user doesn’t want other users to have access their data in S3..

On Wed, Jun 19, 2019 at 10:24 PM Scott Henderson notifications@github.com wrote:

Good ideas @apawloski https://github.com/apawloski. We originally had S3 accesses limited to the VPC, but users want to upload data from their laptops or servers at all kinds of IP address. Many of these users are new to aws and do not have their own credentials.

I've been thinking another solution is to use STS to pass out temporary credentials. An administrator can do this every 24 hours, but that is annoying.

I think we could modify the EC2 role so that users can run the command to get 24-hour credentials that they could use to access the hub bucket from their laptop?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/247?email_source=notifications&email_token=AAL3P3QJBN5KOOTF4PCEOPTP3KIZBA5CNFSM4HI2XNN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYDFVSA#issuecomment-503732936, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL3P3RDPL2ECX7RQBEUBZLP3KIZBANCNFSM4HI2XNNQ .

yuvipanda commented 5 years ago

https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity sortof makes this much simpler on GKE...

scottyhq commented 5 years ago

Of course none of this addresses the namespacing situation where one user doesn’t want other users to have access their data in S3..

True. There are two separate issues being discussed here. 1) a private per-user S3 bucket space and 2) a common group s3 bucket (e.g. pangeo-data) I'm going to create a new issue describing number 2