Give all notebook pods a temporary bucket

rabernat commented 4 years ago

I've been thinking about what would help things work more smoothly on our cloud hubs in terms of data storage. One clear need is a place to put temporary data. Filesystem-based solutions are not a good solution because they are hard to share with dask workers.

What if we could create a temporary bucket for each user pod, which is automatically deleted at the end of each session? This would be awesome. We could propagate write credentials to the bucket to the dask workers, so that people could dump as much temporary data there as they want. But by deleting at the end of each session, we avoid blowing up our storage costs.

It seems like this sort of things should be possible with kubernetes, but I'm not sure how to do it.

scottyhq commented 4 years ago

This is a great idea. It's been discussed a few times in separate issues. If anyone is willing to spend some time on it, I think what is described here is a viable way forward: https://github.com/pangeo-data/pangeo-cloud-federation/issues/485#issuecomment-560939910

This would be cloud-provider specific though.

robfatland commented 4 years ago

Perhaps create/delete pod-associated folders within a single persistent temp data bucket? Bucket create/delete might be more cumbersome. Fear zombies. Rob Fatland UW Research Computing Director

On Fri, May 15, 2020 at 12:21 PM Ryan Abernathey notifications@github.com wrote:

I've been thinking about what would help things work more smoothly on our cloud hubs in terms of data storage. One clear need is a place to put temporary data. Filesystem-based solutions are not a good solution because they are hard to share with dask workers.

What if we could create a temporary bucket for each user pod, which is automatically deleted at the end of each session? This would be awesome. We could propagate write credentials to the bucket to the dask workers, so that people could dump as much temporary data there as they want. But by deleting at the end of each session, we avoid blowing up our storage costs.

It seems like this sort of things should be possible with kubernetes, but I'm not sure how to do it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/610, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPJRWN7AWTICSV5ZNWMUODRRWI4HANCNFSM4NCHD2IA .

rabernat commented 4 years ago

Perhaps create/delete pod-associated folders within a single persistent temp data bucket? Bucket create/delete might be more cumbersome.

AFAIK, "folders" don't exist in object storage. You either have write privileges for the entire bucket or not. So if we do this, and we care about isolating user data, we have to do it at the bucket level.

If we don't care about isolating user data, we could just have a single, global readable / writeable bucket. But we would be open to the possibility that one user could delete everyone else's data.

This makes me miss POSIX permissions.

scottyhq commented 4 years ago

AFAIK, "folders" don't exist in object storage. You either have write privileges for the entire bucket or not. So if we do this, and we care about isolating user data, we have to do it at the bucket level.

True, but you can effectively accomplish this by a bucket policy or user policy for access to different "object prefixes" which act like folders. Described here for AWS, I image there is a similar setup for GCP: https://aws.amazon.com/blogs/security/writing-iam-policies-grant-access-to-user-specific-folders-in-an-amazon-s3-bucket/

The difficulty is mapping hub users to cloud account credentials. That would be accomplished via Auth0. It seems tricky but doable to modify KubeSpawner to link up the jupyter username to a Cloud access token, so using myself as an example I'd get access to s3://pangeo-scratch/scottyhq/ but not s3://pangeo-scratch/robfatland/ while logged in. Ultimately I think this approach will be critical in order to better track usage and costs per user.

If we don't care about isolating user data, we could just have a single, global readable / writeable bucket. But we would be open to the possibility that one user could delete everyone else's data.

We can do this very easily. I've tried it on the aws hub s3://pangeo-scratch with an expiration policy of 1 day. Seems to be working.

scottyhq commented 4 years ago

Question for @yuvipanda or @consideRatio. I was trying to follow the approach here https://github.com/berkeley-dsep-infra/datahub/pull/713 to give each user a bucket in GCP. If I understand correctly, the only way to give Kubespawner commands access to additional command line tools or python libraries (such as awscli or gcloud) is to modify the standard Hub image? (https://github.com/berkeley-dsep-infra/datahub/pull/713/files#diff-e71ae0db512a9f529e23dd65da53a262)

Is there anyway around that?

rabernat commented 4 years ago

I image there is a similar setup for GCP

Yes, I think this is called Access Control Lists.

It sounds like we should be able to use ACLs and lifecycle management to provide per-user object storage!

We can't easily provide a size-based quota, which would be ideal. Instead, we can site a time limit on temporary objects. I feel like 7 or 14 days would be reasonable. Inevitably users would lose data unexpectedly until they got used to working this way.

rabernat commented 4 years ago

I'd love to brainstorm a way forward on this at tomorrow's dev meeting. Some questions that come to mind:

One scratch bucket per user vs. one global scratch bucket? I'm currently leaning towards one scratch bucket per user. GCP has no limit on the number of buckets you can create. Monitoring and rules are easier to implement at the bucket level.
What are the blockers to implementation? @yuvipanda's work referenced above gives us a pretty clear path to implementation.
What about binder? Should Pangeo binders get scratch space? I think so. But how to we make it secure?

scottyhq commented 4 years ago

One scratch bucket per user vs. one global scratch bucket? I'm currently leaning towards one scratch bucket per user. GCP has no limit on the number of buckets you can create. Monitoring and rules are easier to implement at the bucket level.

Sounds like the details will differ across cloud-providers, which I think is ok. For AWS the general recommendation seems to be one bucket with differing object permissions per user.

Another question to table:

How many users will their be for hubs and binderhubs going forward?

For BinderHubs, keeping open to any GitHub user is key for outreach and workshops. But I'm concerned about creating many 'things' (buckets, policies, etc) for an ever-increasing number of users. It might be okay if those things are tied to a session and are deleted automatically... For Hubs, there is at least a cap of several hundred users based on GitHub organization membership.

Another option worth considering is BYOB, where we document for users how to create a bucket on their own account and connect access from the hub or binderhub. (more complicated for users obviously, but more sustainable for those of us administering resources with limited time and credits to go around).

rabernat commented 4 years ago

As discussed on today's call, I'm going to try the approach of a global scratch bucket with no formal user-specific credentials, instead using an environment variable to point each user to the appropriate path.

Can someone tell me which service account corresponds to the user notebook pods? I can't figure it out. https://console.cloud.google.com/iam-admin/serviceaccounts?project=pangeo-181919

Also, dask workers will need access. I assume they are associated with "Dask Worker Service Account" (dask-worker-sa@pangeo-181919.iam.gserviceaccount.com)? Is that correct?

scottyhq commented 4 years ago

the helm config should point to a pangeo service account for both user notebooks https://github.com/pangeo-data/pangeo-cloud-federation/blob/6a8a702cf1257aedb86679ac7d76a43fe5845567/deployments/icesat2/config/common.yaml#L40 and dask workers https://github.com/pangeo-data/pangeo-cloud-federation/blob/6a8a702cf1257aedb86679ac7d76a43fe5845567/pangeo-deploy/values.yaml#L54.

The linking of account credentials is done via the underlying cluster config. For AWS, kubectl get sa pangeo -n icesat2-prod -o yaml, shows the name of the linked role:

kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: ROLEID HERE

I'm guessing it's the same for GCP?

TomAugspurger commented 4 years ago

@scottyhq does assigning that annotation magically grant S3 read / write privileges to the pod (assuming they've been granted to that role)?

For the multicloud demo I went through a much more complicated process of mounting a secrets volume and setting a GOOGLE_APPLICATION_CREDENTIALS file: https://github.com/pangeo-data/multicloud-demo/blob/4421333b72831665fc39b2a3b7e8b4f2f2374e9f/config-gcp.yaml#L12-L23. Documented a bit at https://github.com/pangeo-data/multicloud-demo#notes-on-requester-pays-gcp

martindurant commented 4 years ago

Just wanted to add a point on this:

deleting at the end of each session

That should totally be doable, but should not be counted as reliable, since python kernels or kube pods can disappear without warning. If you are going for prefixes within a bucket, then you would need to list all the files and send delete requests for each, which is potentially expensive to do (compared to nixing a whole bucket). Certainly would like to use async and batch deleting, or maybe even use a dedicated CLI tool if it does a good job. If you rely on life-cycles alone, you may well end up paying more than you hoped. Is it time to convene a group for building async into fsspec and friends?

Note that most of the object stores also allow for archiving data to cheaper storage, which is not appropriate for "scratch"/intermediates, but might be right for output data artefacts. Such archiving can also be done on a life-cycle (e.g., 7 days untouched, archive; 30days untouched, delete).

TomAugspurger commented 4 years ago

Agreed that "at the end of each session" would be hard to do, for the reasons you listed.

Note that most of the object stores also allow for archiving data to cheaper storage, which is not appropriate for "scratch"/intermediates, but might be right for output data artefacts.

Why do you think this wouldn't work for scratch / intermediate? I think we could have an aggressive lifecycle policy, like objects older than 1 day are deleted

Something like this

{
"lifecycle": {
  "rule": [
  {
    "action": {"type": "Delete"},
    "condition": {
      "age": 1,
      "isLive": true
    }
  },
]
}
}

I don't think this "scratch" bucket is well-suited to solving output data artifacts, so we can safely ignore that use case.

martindurant commented 4 years ago

I mean that archival storage is not so useful for intermediates that are likely to be read in again in the near future - better just delete them. I don't know the specifics of each store, but generally access to archived data is not just slower, it comes with quotas and access limits, and frequent access might even end up costing more. There are probably many options for each backend...

TomAugspurger commented 4 years ago

Gotcha, thanks. I think that's a non-issue here as long as we're willing to treat this solely as scratch space. There are other, better options for making data products.

martindurant commented 4 years ago

There are other, better options for making data products.

Agreed. Wouldn't it be nice to automatically create Intake catalogs for anything that is indeed written as a product? Just a thought.

scottyhq commented 4 years ago

does assigning that annotation magically grant S3 read / write privileges to the pod (assuming they've been granted to that role)?

The additional step is assigning whatever 'policies' you want to that role (such as S3 read/write), and there are some additional things like OIDC configuration that happens at the cluster level. Full documentation is here - https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html

command line cluster management tools like eksctl take care of most of the details for AWS, or
@salvis2 has this terraform config https://github.com/ICESAT-2HackWeek/terraform-deploy/blob/master/aws/s3-data-bucket.tf

TomAugspurger commented 4 years ago

Right, I suppose to rephrase my question: is there something special in eks that looks for eks.amazonaws.com/role-arn and grants the role ID, or is that a standard kubernetes thing, where I just swap in the GCP names for AWS?

scottyhq commented 4 years ago

@TomAugspurger - as far as I can tell GKE has the equivalent approach to EKS documented here ("Workload identity"): https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity

TomAugspurger commented 4 years ago

Cool, thanks for digging that up?

@rabernat are you likely to attempt to implement this sometime soon? I'm busy most of next week, but may be able to squeeze in an hour or two on Monday to try things out.

On Thu, May 28, 2020 at 12:02 PM Scott Henderson notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger - as far as I can tell GKE has the equivalent approach to EKS documented here ("Workload identity"): https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/610#issuecomment-635472697, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIXJL4R5NKPVXJDRBBTRT2KJRANCNFSM4NCHD2IA .

TomAugspurger commented 4 years ago

Trying this out now for an hour.

rabernat commented 4 years ago

I will likely have some time to work on this on Tuesday afternoon. Thanks @TomAugspurger for you work! Let me know where I can pick things up.

TomAugspurger commented 4 years ago

Short status update:

Created a bucket pangeo-dev-staging (I think one bucket per namespace)
Created a google service account gcs-scratch-sa (this can be global). Granted it read/write permissions (oh, but a big TODO: this needs to be just for the single bucket...)
Enabled "Workload Identity" on the dev-pangeo-io cluster
Am migrating the existing node pools to use Workload Identity (https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#migrate_workloads_to). This is taking a while.

Remaining work

Finish migrating node-pools
Authorize KSA

K8S_NAMESPACE=dev-staging
KSA_NAME=pangeo
GSA_NAME=gcs-scratch-sa
GSA_PROJECT=pangeo-181919

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:cluster_project.svc.id.goog[${K8S_NAMESPACE}/${KSA_NAME}]" \
  ${GSA_NAME}@${GSA_PROJECT}.iam.gserviceaccount.com

Annotate the KSA

kubectl annotate serviceaccount \
  --namespace ${K8S_NAMESPACE} \
   ${KSA_NAME} \
   iam.gke.io/gcp-service-account=${GSA_NAME}@${GSA_PROJECT}.iam.gserviceaccount.com

Ensure all the pods get that annotation.

Hmm one of the nodes just failed to migrate... Will dig into that a bit more but then I do have to shelve this for a bit :)

TomAugspurger commented 4 years ago

The migration might still be happening in the background. Here's the command that timed out

$ gcloud container node-pools update core-pool          --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA

In the GCP UI things are still updating though. If that's correct, then the we'd just need to migrate the remaining pools

gcloud container node-pools update dask-pool          --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
gcloud container node-pools update jupyter-pool       --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
gcloud container node-pools update jupyter-pool-small --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
gcloud container node-pools update jupyter-gpu-pool   --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
gcloud container node-pools update scheduler-pool     --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA

So I'll let that one process a bit longer before kicking off the rest.

martindurant commented 4 years ago

( --async should let your process run in the background without the CLI waiting/timing out)

martindurant commented 4 years ago

It seems async is only implemented on some gcloud commands, and maybe not this one, although there's beta and alpha... Never mind.

TomAugspurger commented 4 years ago

This doesn't look great

Screen Shot 2020-06-01 at 10 58 37 AM

Let me know if you experience any issues with the GCP hubs today. In theory, everything I've done so far is reversible.

scottyhq commented 4 years ago

@TomAugspurger - remind me where the GCP nodegroup config is listed these days? Are you running k8s 1.15 or 1.16 currently on GKE? I'm not sure what the issue is, but for AWS nodegroup upgrades we typically create a separate nodegroup (e.g. dask-pool-v2), then delete the old one.

TomAugspurger commented 4 years ago

Currently on k8s 1.15.9-gke.24. Cycling the node pools makes sense. In-place migration seems like more hassle than it's worth.

Sounds like a pretty opportune time to add a terraform config to this repo :) Then we just adjust a few variable names and let it run.

TomAugspurger commented 4 years ago

I have a PoC working on a separate kubernetes cluster.

The bucket pangeo-scratch is not publictly accessible. But it is read / write accessible from within the cluster.

>>> def check():
...     fs = gcsfs.GCSFileSystem(token="cloud")
...     return {file: fs.open(file).read() for file in fs.ls("pangeo-scratch/bar/")}        

>>> def put(dask_worker):
...     fs = gcsfs.GCSFileSystem(token="cloud")
...     name = dask_worker.address.split(":")[-1]
...     with fs.open("pangeo-scratch/bar/{}.txt".format(name), "wb") as f:
...         f.write(b"hi")

>>> client.run(put)
>>> client.run(check)
{'tls://10.52.4.2:42275': {'pangeo-scratch/bar/35899.txt': b'hi',
  'pangeo-scratch/bar/42275.txt': b'hi'},
 'tls://10.52.4.3:35899': {'pangeo-scratch/bar/35899.txt': b'hi',
  'pangeo-scratch/bar/42275.txt': b'hi'}}

The service account can only write to pangeo-scratch. It can't write to other buckets in the project, like pangeo-billing.

OSError: Forbidden: https://www.googleapis.com/upload/storage/v1/b/pangeo-billing/o
pangeo@pangeo-181919.iam.gserviceaccount.com does not have storage.objects.create access to pangeo-billing/bar/42275.txt.

I couldn't get auto nodepools working. Sorry Joe :)

I think the next step is to roll this out to the binders / hubs by creating new nodepools with this attribute set, and then deleting the old ones. I'll start with dev-staging tomorrow and see how it goes.

TomAugspurger commented 4 years ago

https://github.com/pangeo-data/pangeo-cloud-federation/pull/613 has (I think) all the necessary changes to the helm config. Just using the pangeo Kubernetes service account in more places. I think we want the user, scheduler, and worker pods to all be able to read / write to the bucket.

I've also created some node pools in the GCP cluster with workload identity enabled. We'll just need to remove the old node pools (I think we can do that whenever. No harm in doing it early I think).

TomAugspurger commented 4 years ago

I think this is working, if people want to try things out. I've enabled it for dev-staging, ocean-staging, dev-prod, ocean-prod. I'll be pushing up docs on the configuration later today or tomorrow. For now I've set the lifecycle policy on pangeo-scratch to be 1 day. Objects older than 1 day are deleted.

We'll also want to provide some docs to users about how to actually use this, but the short version is that

fs = gcsfs.GCSFileSystem(token="cloud")

should let you read / write to the pangeo-scratch bucket.

scottyhq commented 4 years ago

Seems to be working as well on AWS, but not thoroughly tested. NOTE on aws-uswest2.pangeo.io objects in s3://pangeo-scratch are wiped 24 hours after they are uploaded.

import s3fs
fs = s3fs.S3FileSystem()
fs.ls('pangeo-scratch')
lpath = 'ATL06_20190928165055_00270510_003_01.h5'
rpath = 'pangeo-scratch/scottyhq/ATL06_20190928165055_00270510_003_01.h5'
fs.upload(lpath, rpath)
s3obj = fs.open(rpath)
ds = xr.open_dataset(s3obj, engine='h5netcdf')

TomAugspurger commented 4 years ago

Seems to be working with rechunker:

import zarr
import dask
import dask.array as da
import numpy as np
from matplotlib import pyplot as plt
import gcsfs

client = gateway.get_client()
fs = gcsfs.GCSFileSystem(token="cloud")

base_dir = "gcs://pangeo-scratch/taugspurger/rechunker/test_data"

store_source = fs.get_mapper(f'{base_dir}/source.zarr')

shape = (80000, 8000)
source_chunks = (200, 8000)
dtype = 'f4'

fs.rm(f'{base_dir}/source.zarr', recursive=True)
fs.rm(f'{base_dir}/target.zarr', recursive=True)
fs.rm(f'{base_dir}/temp.zarr', recursive=True)

a_source = zarr.ones(shape, chunks=source_chunks,
                     dtype=dtype, store=store_source)

target_store = fs.get_mapper(f'{base_dir}/target.zarr')
temp_store = fs.get_mapper(f'{base_dir}/temp.zarr')
max_mem = 25600000
target_chunks = (8000, 200)

from distributed import performance_report
from rechunker import api

res = api.rechunk_zarr2zarr_w_dask(a_source, target_chunks, max_mem,
                             target_store, temp_store=temp_store)

with performance_report():
    out = res.compute()

https://gistcdn.rawgit.org/TomAugspurger/9150ff7db8e89ba7ed7ce3b0965694e5/230d3db4f16356eb25a94cf9a70f9d2233c27595/dask-report(1).html

rabernat commented 4 years ago

Should we add something to the new chart to populate an environment variable with gs://pangeo-scratch/<user_id/?

TomAugspurger commented 4 years ago

I don't know if gcs supports prefix-level object lifecycles, so I worry that the <user_id>/ prefix would just be deleted.

rabernat commented 4 years ago

I don't know if gcs supports prefix-level object lifecycles, so I worry that the <user_id>/ prefix would just be deleted.

Does this matter? These are not actual directories, just keys. You can write to gs://pangeo-scratch/rabernat/deep/nested/path as long as the bucket exists.

TomAugspurger commented 4 years ago

Sorry, I misread your comment. I thought you were suggesting pre-populating the bucket with the key pangeo-scratch/<user_id>, rather than adding an environment variable. Yes, an environment variable would help with avoiding conflicts.

TomAugspurger commented 4 years ago

Setting $PANGEO_HOME is a bit harder that I expected. We can't just set pangeo.jupyter.singleuser.extraenv.PANGEO_HOME='gcs://pangeo-scrats/$JUPYTERHUB_USER/', since we need the evaluated value of $JUPYTERHUB_USER (https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1255). I'm not sure if modifying start in https://github.com/pangeo-data/pangeo-docker-images/blob/master/pangeo-notebook/start will do the trick or not.

rabernat commented 4 years ago

I'm not sure if modifying start in https://github.com/pangeo-data/pangeo-docker-images/blob/master/pangeo-notebook/start will do the trick or not.

This sounds like the way to go. We could have a short bash script which tries to figure out what cloud we are on and sets PANGEO_SCRATCH appropriately.

Note that I prefer PANGEO_SCRATCH rather than `PANGEO_HOME. We should remind users at every step of the way that the storage is temporary.

TomAugspurger commented 4 years ago

So I think the steps are

PR to pangeo-cloud-federation that defines an environment variable SCRATCH_PREFIX: (gs://, s3://, etc.)
PR to pangeo-docker-images in start that checks $SCRATCH_PREFIX and sets PANGEO_SCRATCH to $SCRATCH_PREFIX://pangeo-scratch/$JUPYTERHUB_USER/ .

jhamman commented 4 years ago

I'm not sure we're going to be able to expand the JUPYTERHUB_USER environment variable as you are hoping but the place to try this is either in the start script or in the single-user section of the helm chart: https://zero-to-jupyterhub.readthedocs.io/en/latest/customizing/user-environment.html#set-environment-variables

TomAugspurger commented 4 years ago

I think that it has to be the start script. The helm chart is too early on in the process.

On Mon, Jun 29, 2020 at 4:19 PM Joe Hamman notifications@github.com wrote:

I'm not sure we're going to be able to expand the JUPYTERHUB_USER environment variable as you are hoping but the place to try this is either in the start script or in the single-user section of the helm chart: https://zero-to-jupyterhub.readthedocs.io/en/latest/customizing/user-environment.html#set-environment-variables

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/610#issuecomment-651372999, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITE5QHHLJGZCNWDI2DRZEAMTANCNFSM4NCHD2IA .

TomAugspurger commented 4 years ago

Pretty sure everything is done here.

yuvipanda commented 3 years ago

https://github.com/2i2c-org/pilot-hubs/pull/283 is the implementation I've ended up with, relying on GKE's cloud connector - there are similar things for AWS & AKS too. I also avoided the need for setting PANGEO_SCRATCH in the docker image with some fuckery here and here. This sets everything up as soon as I create a new hub, without any need for human intervention! YAY!

pangeo-data / pangeo-cloud-federation

Give all notebook pods a temporary bucket #610