Closed rabernat closed 4 years ago
This is a great idea. It's been discussed a few times in separate issues. If anyone is willing to spend some time on it, I think what is described here is a viable way forward: https://github.com/pangeo-data/pangeo-cloud-federation/issues/485#issuecomment-560939910
This would be cloud-provider specific though.
Perhaps create/delete pod-associated folders within a single persistent temp data bucket? Bucket create/delete might be more cumbersome. Fear zombies. Rob Fatland UW Research Computing Director
On Fri, May 15, 2020 at 12:21 PM Ryan Abernathey notifications@github.com wrote:
I've been thinking about what would help things work more smoothly on our cloud hubs in terms of data storage. One clear need is a place to put temporary data. Filesystem-based solutions are not a good solution because they are hard to share with dask workers.
What if we could create a temporary bucket for each user pod, which is automatically deleted at the end of each session? This would be awesome. We could propagate write credentials to the bucket to the dask workers, so that people could dump as much temporary data there as they want. But by deleting at the end of each session, we avoid blowing up our storage costs.
It seems like this sort of things should be possible with kubernetes, but I'm not sure how to do it.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/610, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPJRWN7AWTICSV5ZNWMUODRRWI4HANCNFSM4NCHD2IA .
Perhaps create/delete pod-associated folders within a single persistent temp data bucket? Bucket create/delete might be more cumbersome.
AFAIK, "folders" don't exist in object storage. You either have write privileges for the entire bucket or not. So if we do this, and we care about isolating user data, we have to do it at the bucket level.
If we don't care about isolating user data, we could just have a single, global readable / writeable bucket. But we would be open to the possibility that one user could delete everyone else's data.
This makes me miss POSIX permissions.
AFAIK, "folders" don't exist in object storage. You either have write privileges for the entire bucket or not. So if we do this, and we care about isolating user data, we have to do it at the bucket level.
True, but you can effectively accomplish this by a bucket policy or user policy for access to different "object prefixes" which act like folders. Described here for AWS, I image there is a similar setup for GCP: https://aws.amazon.com/blogs/security/writing-iam-policies-grant-access-to-user-specific-folders-in-an-amazon-s3-bucket/
The difficulty is mapping hub users to cloud account credentials. That would be accomplished via Auth0. It seems tricky but doable to modify KubeSpawner to link up the jupyter username to a Cloud access token, so using myself as an example I'd get access to s3://pangeo-scratch/scottyhq/
but not s3://pangeo-scratch/robfatland/
while logged in. Ultimately I think this approach will be critical in order to better track usage and costs per user.
If we don't care about isolating user data, we could just have a single, global readable / writeable bucket. But we would be open to the possibility that one user could delete everyone else's data.
We can do this very easily. I've tried it on the aws hub s3://pangeo-scratch
with an expiration policy of 1 day. Seems to be working.
Question for @yuvipanda or @consideRatio. I was trying to follow the approach here https://github.com/berkeley-dsep-infra/datahub/pull/713 to give each user a bucket in GCP. If I understand correctly, the only way to give Kubespawner commands access to additional command line tools or python libraries (such as awscli
or gcloud
) is to modify the standard Hub image? (https://github.com/berkeley-dsep-infra/datahub/pull/713/files#diff-e71ae0db512a9f529e23dd65da53a262)
Is there anyway around that?
I image there is a similar setup for GCP
Yes, I think this is called Access Control Lists.
It sounds like we should be able to use ACLs and lifecycle management to provide per-user object storage!
We can't easily provide a size-based quota, which would be ideal. Instead, we can site a time limit on temporary objects. I feel like 7 or 14 days would be reasonable. Inevitably users would lose data unexpectedly until they got used to working this way.
I'd love to brainstorm a way forward on this at tomorrow's dev meeting. Some questions that come to mind:
One scratch bucket per user vs. one global scratch bucket? I'm currently leaning towards one scratch bucket per user. GCP has no limit on the number of buckets you can create. Monitoring and rules are easier to implement at the bucket level.
Sounds like the details will differ across cloud-providers, which I think is ok. For AWS the general recommendation seems to be one bucket with differing object permissions per user.
Another question to table:
For BinderHubs, keeping open to any GitHub user is key for outreach and workshops. But I'm concerned about creating many 'things' (buckets, policies, etc) for an ever-increasing number of users. It might be okay if those things are tied to a session and are deleted automatically... For Hubs, there is at least a cap of several hundred users based on GitHub organization membership.
Another option worth considering is BYOB, where we document for users how to create a bucket on their own account and connect access from the hub or binderhub. (more complicated for users obviously, but more sustainable for those of us administering resources with limited time and credits to go around).
As discussed on today's call, I'm going to try the approach of a global scratch bucket with no formal user-specific credentials, instead using an environment variable to point each user to the appropriate path.
Can someone tell me which service account corresponds to the user notebook pods? I can't figure it out. https://console.cloud.google.com/iam-admin/serviceaccounts?project=pangeo-181919
Also, dask workers will need access. I assume they are associated with "Dask Worker Service Account" (dask-worker-sa@pangeo-181919.iam.gserviceaccount.com)? Is that correct?
the helm config should point to a pangeo
service account for both user notebooks https://github.com/pangeo-data/pangeo-cloud-federation/blob/6a8a702cf1257aedb86679ac7d76a43fe5845567/deployments/icesat2/config/common.yaml#L40 and dask workers https://github.com/pangeo-data/pangeo-cloud-federation/blob/6a8a702cf1257aedb86679ac7d76a43fe5845567/pangeo-deploy/values.yaml#L54.
The linking of account credentials is done via the underlying cluster config. For AWS, kubectl get sa pangeo -n icesat2-prod -o yaml
, shows the name of the linked role:
kind: ServiceAccount
metadata:
annotations:
eks.amazonaws.com/role-arn: ROLEID HERE
I'm guessing it's the same for GCP?
@scottyhq does assigning that annotation magically grant S3 read / write privileges to the pod (assuming they've been granted to that role)?
For the multicloud demo I went through a much more complicated process of mounting a secrets volume and setting a GOOGLE_APPLICATION_CREDENTIALS file: https://github.com/pangeo-data/multicloud-demo/blob/4421333b72831665fc39b2a3b7e8b4f2f2374e9f/config-gcp.yaml#L12-L23. Documented a bit at https://github.com/pangeo-data/multicloud-demo#notes-on-requester-pays-gcp
Just wanted to add a point on this:
deleting at the end of each session
That should totally be doable, but should not be counted as reliable, since python kernels or kube pods can disappear without warning. If you are going for prefixes within a bucket, then you would need to list all the files and send delete requests for each, which is potentially expensive to do (compared to nixing a whole bucket). Certainly would like to use async and batch deleting, or maybe even use a dedicated CLI tool if it does a good job. If you rely on life-cycles alone, you may well end up paying more than you hoped. Is it time to convene a group for building async into fsspec and friends?
Note that most of the object stores also allow for archiving data to cheaper storage, which is not appropriate for "scratch"/intermediates, but might be right for output data artefacts. Such archiving can also be done on a life-cycle (e.g., 7 days untouched, archive; 30days untouched, delete).
Agreed that "at the end of each session" would be hard to do, for the reasons you listed.
Note that most of the object stores also allow for archiving data to cheaper storage, which is not appropriate for "scratch"/intermediates, but might be right for output data artefacts.
Why do you think this wouldn't work for scratch / intermediate? I think we could have an aggressive lifecycle policy, like objects older than 1 day are deleted
Something like this
{
"lifecycle": {
"rule": [
{
"action": {"type": "Delete"},
"condition": {
"age": 1,
"isLive": true
}
},
]
}
}
I don't think this "scratch" bucket is well-suited to solving output data artifacts, so we can safely ignore that use case.
I mean that archival storage is not so useful for intermediates that are likely to be read in again in the near future - better just delete them. I don't know the specifics of each store, but generally access to archived data is not just slower, it comes with quotas and access limits, and frequent access might even end up costing more. There are probably many options for each backend...
Gotcha, thanks. I think that's a non-issue here as long as we're willing to treat this solely as scratch space. There are other, better options for making data products.
There are other, better options for making data products.
Agreed. Wouldn't it be nice to automatically create Intake catalogs for anything that is indeed written as a product? Just a thought.
does assigning that annotation magically grant S3 read / write privileges to the pod (assuming they've been granted to that role)?
The additional step is assigning whatever 'policies' you want to that role (such as S3 read/write), and there are some additional things like OIDC configuration that happens at the cluster level. Full documentation is here - https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html
command line cluster management tools like eksctl take care of most of the details for AWS, or
@salvis2 has this terraform config https://github.com/ICESAT-2HackWeek/terraform-deploy/blob/master/aws/s3-data-bucket.tf
Right, I suppose to rephrase my question: is there something special in eks that looks for eks.amazonaws.com/role-arn
and grants the role ID, or is that a standard kubernetes thing, where I just swap in the GCP names for AWS?
@TomAugspurger - as far as I can tell GKE has the equivalent approach to EKS documented here ("Workload identity"): https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity
Cool, thanks for digging that up?
@rabernat are you likely to attempt to implement this sometime soon? I'm busy most of next week, but may be able to squeeze in an hour or two on Monday to try things out.
On Thu, May 28, 2020 at 12:02 PM Scott Henderson notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger - as far as I can tell GKE has the equivalent approach to EKS documented here ("Workload identity"): https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/610#issuecomment-635472697, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIXJL4R5NKPVXJDRBBTRT2KJRANCNFSM4NCHD2IA .
Trying this out now for an hour.
I will likely have some time to work on this on Tuesday afternoon. Thanks @TomAugspurger for you work! Let me know where I can pick things up.
Short status update:
pangeo-dev-staging
(I think one bucket per namespace)gcs-scratch-sa
(this can be global). Granted it read/write permissions (oh, but a big TODO: this needs to be just for the single bucket...)Remaining work
K8S_NAMESPACE=dev-staging
KSA_NAME=pangeo
GSA_NAME=gcs-scratch-sa
GSA_PROJECT=pangeo-181919
gcloud iam service-accounts add-iam-policy-binding \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:cluster_project.svc.id.goog[${K8S_NAMESPACE}/${KSA_NAME}]" \
${GSA_NAME}@${GSA_PROJECT}.iam.gserviceaccount.com
kubectl annotate serviceaccount \
--namespace ${K8S_NAMESPACE} \
${KSA_NAME} \
iam.gke.io/gcp-service-account=${GSA_NAME}@${GSA_PROJECT}.iam.gserviceaccount.com
Hmm one of the nodes just failed to migrate... Will dig into that a bit more but then I do have to shelve this for a bit :)
The migration might still be happening in the background. Here's the command that timed out
$ gcloud container node-pools update core-pool --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
In the GCP UI things are still updating though. If that's correct, then the we'd just need to migrate the remaining pools
gcloud container node-pools update dask-pool --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
gcloud container node-pools update jupyter-pool --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
gcloud container node-pools update jupyter-pool-small --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
gcloud container node-pools update jupyter-gpu-pool --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
gcloud container node-pools update scheduler-pool --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
So I'll let that one process a bit longer before kicking off the rest.
( --async
should let your process run in the background without the CLI waiting/timing out)
It seems async is only implemented on some gcloud commands, and maybe not this one, although there's beta and alpha... Never mind.
This doesn't look great
Let me know if you experience any issues with the GCP hubs today. In theory, everything I've done so far is reversible.
@TomAugspurger - remind me where the GCP nodegroup config is listed these days? Are you running k8s 1.15 or 1.16 currently on GKE? I'm not sure what the issue is, but for AWS nodegroup upgrades we typically create a separate nodegroup (e.g. dask-pool-v2
), then delete the old one.
Currently on k8s 1.15.9-gke.24. Cycling the node pools makes sense. In-place migration seems like more hassle than it's worth.
Sounds like a pretty opportune time to add a terraform config to this repo :) Then we just adjust a few variable names and let it run.
I have a PoC working on a separate kubernetes cluster.
The bucket pangeo-scratch
is not publictly accessible. But it is read / write accessible from within the cluster.
>>> def check():
... fs = gcsfs.GCSFileSystem(token="cloud")
... return {file: fs.open(file).read() for file in fs.ls("pangeo-scratch/bar/")}
>>> def put(dask_worker):
... fs = gcsfs.GCSFileSystem(token="cloud")
... name = dask_worker.address.split(":")[-1]
... with fs.open("pangeo-scratch/bar/{}.txt".format(name), "wb") as f:
... f.write(b"hi")
>>> client.run(put)
>>> client.run(check)
{'tls://10.52.4.2:42275': {'pangeo-scratch/bar/35899.txt': b'hi',
'pangeo-scratch/bar/42275.txt': b'hi'},
'tls://10.52.4.3:35899': {'pangeo-scratch/bar/35899.txt': b'hi',
'pangeo-scratch/bar/42275.txt': b'hi'}}
The service account can only write to pangeo-scratch
. It can't write to other buckets in the project, like pangeo-billing
.
OSError: Forbidden: https://www.googleapis.com/upload/storage/v1/b/pangeo-billing/o
pangeo@pangeo-181919.iam.gserviceaccount.com does not have storage.objects.create access to pangeo-billing/bar/42275.txt.
I couldn't get auto nodepools working. Sorry Joe :)
I think the next step is to roll this out to the binders / hubs by creating new nodepools with this attribute set, and then deleting the old ones. I'll start with dev-staging
tomorrow and see how it goes.
https://github.com/pangeo-data/pangeo-cloud-federation/pull/613 has (I think) all the necessary changes to the helm config. Just using the pangeo Kubernetes service account in more places. I think we want the user, scheduler, and worker pods to all be able to read / write to the bucket.
I've also created some node pools in the GCP cluster with workload identity enabled. We'll just need to remove the old node pools (I think we can do that whenever. No harm in doing it early I think).
I think this is working, if people want to try things out. I've enabled it for dev-staging, ocean-staging, dev-prod, ocean-prod. I'll be pushing up docs on the configuration later today or tomorrow. For now I've set the lifecycle policy on pangeo-scratch
to be 1 day. Objects older than 1 day are deleted.
We'll also want to provide some docs to users about how to actually use this, but the short version is that
fs = gcsfs.GCSFileSystem(token="cloud")
should let you read / write to the pangeo-scratch
bucket.
Seems to be working as well on AWS, but not thoroughly tested. NOTE on aws-uswest2.pangeo.io objects in s3://pangeo-scratch
are wiped 24 hours after they are uploaded.
import s3fs
fs = s3fs.S3FileSystem()
fs.ls('pangeo-scratch')
lpath = 'ATL06_20190928165055_00270510_003_01.h5'
rpath = 'pangeo-scratch/scottyhq/ATL06_20190928165055_00270510_003_01.h5'
fs.upload(lpath, rpath)
s3obj = fs.open(rpath)
ds = xr.open_dataset(s3obj, engine='h5netcdf')
Seems to be working with rechunker:
import zarr
import dask
import dask.array as da
import numpy as np
from matplotlib import pyplot as plt
import gcsfs
client = gateway.get_client()
fs = gcsfs.GCSFileSystem(token="cloud")
base_dir = "gcs://pangeo-scratch/taugspurger/rechunker/test_data"
store_source = fs.get_mapper(f'{base_dir}/source.zarr')
shape = (80000, 8000)
source_chunks = (200, 8000)
dtype = 'f4'
fs.rm(f'{base_dir}/source.zarr', recursive=True)
fs.rm(f'{base_dir}/target.zarr', recursive=True)
fs.rm(f'{base_dir}/temp.zarr', recursive=True)
a_source = zarr.ones(shape, chunks=source_chunks,
dtype=dtype, store=store_source)
target_store = fs.get_mapper(f'{base_dir}/target.zarr')
temp_store = fs.get_mapper(f'{base_dir}/temp.zarr')
max_mem = 25600000
target_chunks = (8000, 200)
from distributed import performance_report
from rechunker import api
res = api.rechunk_zarr2zarr_w_dask(a_source, target_chunks, max_mem,
target_store, temp_store=temp_store)
with performance_report():
out = res.compute()
Should we add something to the new chart to populate an environment variable with gs://pangeo-scratch/<user_id/
?
I don't know if gcs supports prefix-level object lifecycles, so I worry that the <user_id>/
prefix would just be deleted.
I don't know if gcs supports prefix-level object lifecycles, so I worry that the
<user_id>/
prefix would just be deleted.
Does this matter? These are not actual directories, just keys. You can write to gs://pangeo-scratch/rabernat/deep/nested/path
as long as the bucket exists.
Sorry, I misread your comment. I thought you were suggesting pre-populating the bucket with the key pangeo-scratch/<user_id>
, rather than adding an environment variable. Yes, an environment variable would help with avoiding conflicts.
Setting $PANGEO_HOME
is a bit harder that I expected. We can't just set pangeo.jupyter.singleuser.extraenv.PANGEO_HOME='gcs://pangeo-scrats/$JUPYTERHUB_USER/'
, since we need the evaluated value of $JUPYTERHUB_USER
(https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1255). I'm not sure if modifying start
in https://github.com/pangeo-data/pangeo-docker-images/blob/master/pangeo-notebook/start will do the trick or not.
I'm not sure if modifying
start
in https://github.com/pangeo-data/pangeo-docker-images/blob/master/pangeo-notebook/start will do the trick or not.
This sounds like the way to go. We could have a short bash script which tries to figure out what cloud we are on and sets PANGEO_SCRATCH
appropriately.
Note that I prefer PANGEO_SCRATCH
rather than `PANGEO_HOME
. We should remind users at every step of the way that the storage is temporary.
So I think the steps are
gs://
, s3://
, etc.)start
that checks $SCRATCH_PREFIX
and sets PANGEO_SCRATCH
to $SCRATCH_PREFIX://pangeo-scratch/$JUPYTERHUB_USER/
.I'm not sure we're going to be able to expand the JUPYTERHUB_USER
environment variable as you are hoping but the place to try this is either in the start script or in the single-user section of the helm chart: https://zero-to-jupyterhub.readthedocs.io/en/latest/customizing/user-environment.html#set-environment-variables
I think that it has to be the start script. The helm chart is too early on in the process.
On Mon, Jun 29, 2020 at 4:19 PM Joe Hamman notifications@github.com wrote:
I'm not sure we're going to be able to expand the JUPYTERHUB_USER environment variable as you are hoping but the place to try this is either in the start script or in the single-user section of the helm chart: https://zero-to-jupyterhub.readthedocs.io/en/latest/customizing/user-environment.html#set-environment-variables
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/610#issuecomment-651372999, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITE5QHHLJGZCNWDI2DRZEAMTANCNFSM4NCHD2IA .
Pretty sure everything is done here.
https://github.com/2i2c-org/pilot-hubs/pull/283 is the implementation I've ended up with, relying on GKE's cloud connector - there are similar things for AWS & AKS too. I also avoided the need for setting PANGEO_SCRATCH in the docker image with some fuckery here and here. This sets everything up as soon as I create a new hub, without any need for human intervention! YAY!
I've been thinking about what would help things work more smoothly on our cloud hubs in terms of data storage. One clear need is a place to put temporary data. Filesystem-based solutions are not a good solution because they are hard to share with dask workers.
What if we could create a temporary bucket for each user pod, which is automatically deleted at the end of each session? This would be awesome. We could propagate write credentials to the bucket to the dask workers, so that people could dump as much temporary data there as they want. But by deleting at the end of each session, we avoid blowing up our storage costs.
It seems like this sort of things should be possible with kubernetes, but I'm not sure how to do it.