pangeo-data / pangeo-cloud-federation

Deployment automation for Pangeo JupyterHubs on AWS, Google, and Azure
https://pangeo.io/cloud.html
58 stars 32 forks source link

dask pod service account access to non public storage (s3, gs buckets) #485

Open scottyhq opened 4 years ago

scottyhq commented 4 years ago

For better security and cost-savings we are moving towards non-public (requester pays) buckets for data storage. To access these buckets on AWS we recently reconfigured the hubs to assign an IAM Role to Kubernetes Service Account. Specifically, the daskkubernetes service account gets an iam role that has a policy for accessing specific buckets in the same region. The daskkubernetes service account gets assigned to jupyterhub users in the pangeo helm chart here: `https://github.com/pangeo-data/helm-chart/blob/56dc755ed0b56ad00571373d70c7fe0eaae5d556/pangeo/values.yaml#L25

This works great for pulling data into a jupyter session, but we're currently encountering errors when loading data with dask workers via s3fs/fsspec. The errors are not always clear as to a permissions issue: returned non-zero exit status 127. and KilledWorker: ('zarr-df194f82d92e97d5d5e60f0de5da8a42', <Worker 'tcp://192.168.169.195:33807', memory: 0, processing: 3>) .

I think the root if the issue is that Dask worker pods currently are assigned the default service account and therefore do not have permissions for accessing non public pangeo datasets. kubectl get pod -o yaml -n binder-staging dask-scottyhq-pangeo-binder-test-xg8nlaic-f8372c69-9mmg6m | grep service

    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
  serviceAccount: default
  serviceAccountName: default

One solution is linking cloud-provider permissions to the default service account, but should we instead create a new service account exclusively for dask worker pods?

pinging @jacobtomlinson @TomAugspurger and @martindurant per @rsignell-usgs and @jhamman 's suggestion

TomAugspurger commented 4 years ago

I think the root if the issue is that Dask worker pods currently are assigned the default service account and therefore do not have permissions for accessing non public pangeo datasets.

To verify this, you could try

def func():
    # function to load the data. Something like
    fs = s3fs.S3FileSystem()  # rely on the service account
    fs.open("path/to/private/object")

In theory, func() should work on the client, but client.run(func) would fail.

scottyhq commented 4 years ago

Thanks @TomAugspurger - forgot to include a code block! Here is output from your test case run on the aws-uswest2 hub:

(s3fs=0.4, dask=2.8.1, botocore=1.13.29)

def func():
    import s3fs
    # function to load the data. Something like
    fs = s3fs.S3FileSystem()  # rely on the service account
    fs.open("pangeo-data-uswest2/esip/NWM2/2017")

client.run(func)
/srv/conda/envs/notebook/lib/python3.7/site-packages/botocore/auth.py in add_auth()
    355     def add_auth(self, request):
    356         if self.credentials is None:
--> 357             raise NoCredentialsError
    358         datetime_now = datetime.datetime.utcnow()
    359         request.context['timestamp'] = datetime_now.strftime(SIGV4_TIMESTAMP)

NoCredentialsError: Unable to locate credentials

Note also that the AWS docs suggest a minimum cli version of 1.16.283 to resolve credentials via the service account, which seems to install botocore 1.13.19.

martindurant commented 4 years ago

It would make sense to me if the dask workers and the normal user interactive pods had the same ownership and permissions. The only difference is that a dask worker would not normally want to create new pods (but it perhaps could).

Is the above situation with dask-kubernetes or dask-gateway?

scottyhq commented 4 years ago

It would make sense to me if the dask workers and the normal user interactive pods had the same ownership and permissions. The only difference is that a dask worker would not normally want to create new pods (but it perhaps could).

Agreed. Is it possible for any dask pods created by a user pod to inherit the same service account? A short-term easy fix is to assign all dask pods the daskkubernetes service account in some dask config setting (here? https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/dask_config.yaml#L31). But further down the line it would be useful for each user to have a unique service account / iam role (for granular permissions and cost-tracking), and then it would be best for dask pods to inherit.

Is the above situation with dask-kubernetes or dask-gateway?

dask-kubernetes.

Still haven't tried with dask-gateway. Maybe @jhamman has?

martindurant commented 4 years ago

I suspect dask-gateway does the right thing here, and yes, I know that trials are underway, but I don't know how far they have progressed. @jcrist would also know both these things.

jcrist commented 4 years ago

A short-term easy fix is to assign all dask pods the daskkubernetes service account in some dask config setting (here? https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/dask_config.yaml#L31).

Yeah, that should work. This wouldn't be any less secure than the status-quo, and should get things working for now.

But further down the line it would be useful for each user to have a unique service account / iam role (for granular permissions and cost-tracking), and then it would be best for dask pods to inherit.

This should be doable with dask-gateway, but nothing is builtin. How would you map usernames to IAM roles/service accounts? If there's a way to do this where dask-gateway doesn't need to store and manage this mapping then this should be fairly easy to hack up with no additional changes to the gateway core itself.

scottyhq commented 4 years ago

How would you map usernames to IAM roles/service accounts? If there's a way to do this where dask-gateway doesn't need to store and manage this mapping then this should be fairly easy to hack up with no additional changes to the gateway core itself.

I don't think there is a straightforward way to do this currently in Zero2JupyterHubK8s config. See https://github.com/dask/dask-kubernetes/issues/202#issuecomment-546864643 and https://github.com/jupyterhub/kubespawner/pull/304.

1) If 304 linked above is merged, it would be straightforward to create a per-user IAM Role as part of a pod startup script and link it to the service account in the per-user namespace https://docs.aws.amazon.com/eks/latest/userguide/specify-service-account-role.html

2) Alternatively it seems possible to have an 'assume role' API call as part of a startup script and inject temporary credentials as environment variables https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1103

jcrist commented 4 years ago

I think you could do this right now by configuring a post_auth_hook to create a new serviceaccount/IAM role for the user (if not already created). The serviceaccount could then be configured for the notebook by adding a modify_pod_hook (alternatively these could be combined to just a modify_pod_hook, probably fine either way). This would allow jupyterhub to manage creating the service accounts per user. I don't think a separate namespace per user would be needed at all here, but may be wrong.

scottyhq commented 4 years ago

In a recent chat with @yuvipanda - he pointed me to a nice model for provisioning per-user policies and buckets on GCP that would be relevant once we get around to trying some of the suggested approaches in this issue https://github.com/berkeley-dsep-infra/datahub/blob/staging/images/hub/sparklyspawner/sparklyspawner/__init__.py