ocean.pangeo.io maintenance hack session

rabernat commented 4 years ago

As discussed in #616 and https://discourse.pangeo.io/t/migration-of-ocean-pangeo-io-user-accounts/644/15, we will be doing maintenance on ocean.pangeo.io and other GCP clusters next week. @jhamman and I have blocked off Monday, June 22, 2-5pm EDT for a sprint on this. I invite everyone, and in particular @TomAugspurger, @scottyhq, @salvis2, @consideRatio, and @yuvipanda to help us out with this.

Some of the things we need to do are:

[x] review the results of the account migration form and decide which user accounts will be migrated
[x] write a script to migrate ORCID to github user IDs
[x] delete non-migrated home directories
[x] switch ocean.pangeo.io authentication to auth0
[x] write documentation explaining deprecation of Dask Kubernetes and how to use Dask Gateway
[x] set up some sort of logging (see #72) so we can better track usage statistics. Pinging @yuvipanda to get the latest on what is the best practice here
[x] change image specification, building to use upstream pieces from pangeo-docker-images

What am I missing from this list?

TomAugspurger commented 4 years ago

I'll be around for most of the session, but will have to pop out for a couple calls.

rabernat commented 4 years ago

I'm looking forward to this hack session today.

jhamman commented 4 years ago

Let's jump in https://whereby.com/pangeo to kick things off.

jhamman commented 4 years ago

Some working notes here: https://hackmd.io/@U4W-olO3TX-hc-cvbjNe4A/r13p_PRaL/edit

TomAugspurger commented 4 years ago

For " write documentation explaining deprecation of Dask Kubernetes and how to use Dask Gateway" we can pull content from https://medium.com/pangeo/pangeo-with-dask-gateway-4b638825f105, specifically https://medium.com/pangeo/pangeo-with-dask-gateway-4b638825f105#af22 for explaining how to transition.

TomAugspurger commented 4 years ago

I can look into logging / monitoring things. Both jupyterhub and Dask expose prometheus metrics, and mybinder has details on capturing & visualizing them: https://mybinder-sre.readthedocs.io/en/latest/components/metrics.html

rabernat commented 4 years ago

I can look into logging / monitoring things. Both jupyterhub and Dask expose prometheus metrics

That's awesome! What we would like most is to be able to run a query to find a how much time an individual user has accumulated over a given period on both jupyter and dask.

jhamman commented 4 years ago

My update from day 1:

[x] Tear down of all three JupyterHub's in GCP
[x] Reorg this repository to include a single GCP hub (renamed ocean to gcp-uscentral1b), see #625
[ ] Develop simple Makefile for standing up new k8s cluster that uses auto nodepool provisioning, see #625

Still to do:

[ ] sort out some details related to rbac/service accounts in #625
[ ] test deployment of pangeo helm chart
[ ] update hubploy / circleci configs

scottyhq commented 4 years ago

update hubploy / circleci configs

This will be useful for just pointing to existing images on DockerHub https://github.com/yuvipanda/hubploy/pull/75

Or if images continue to be built in this repo, it makes sense to put them on DockerHub rather than aws or gcp registries which are harder for people to get to. So could revisit https://github.com/yuvipanda/hubploy/pull/24

rabernat commented 4 years ago

The account migration is in progress. Those with credentials can see the backed up homedirs here: https://console.cloud.google.com/storage/browser/pangeo-homedir-backup

There is a long tail of very large home directories on ocean that will take a very long time to complete.

rabernat commented 4 years ago

For reference, the backup scripts are here: https://gist.github.com/rabernat/c9b352de926756342e86da662a0eadf9

TomAugspurger commented 4 years ago

Or if images continue to be built in this repo, it makes sense to put them on DockerHub rather than aws or gcp registries which are harder for people to get to.

I think we're hoping to still upload to GCP / AWS to keep the startup times as small as possible when an image does need to be downloaded.

salvis2 commented 4 years ago

Today I'll work on standing up a test cluster and testing that Linux hack to enforce user storage limits.

scottyhq commented 4 years ago

@salvis2 @rabernat - before you dive into the storage limits, do you have a solution for dealing with the fact that every user has the same uid and gid (1000,1000)? This has come up a few times before https://github.com/pangeo-data/pangeo-cloud-federation/issues/384#issuecomment-526401660 https://github.com/pangeo-data/pangeo-cloud-federation/issues/25

rabernat commented 4 years ago

My idea was to try to do the quota-ing from within the user's jupyter pod. Basically, this pod is a unix system with one user--jovyan (1000,1000)--whose home directory is mounted from an nfs server.

Is is possible to make this unix instance enforce a quota on that one user? It doesn't have to know about all the other users or address the challenge of duplicated uid / gid. It just has to prevent jovyan from creating more than 10GB of files in /home/jovyan.

Seems like it should be possible to me, but I have likely overlooked something.

scottyhq commented 4 years ago

ok. definitely sounds like something worth exploring!

One more idea/request on the topic of "update hubploy / circleci configs". I think it would be great to drop circleci in favor of github actions. Hubploy now works with github actions (for example https://github.com/ICESAT-2HackWeek/jupyterhub-2020). And we could make use of organization level secrets to reduce scattering in various places. https://github.blog/changelog/2020-05-14-organization-secrets/.

rabernat commented 4 years ago

I think it would be great to drop circleci in favor of github actions. Hubploy now works with github actions

💯 x 👍

rabernat commented 4 years ago

Home directory backup is complete. Should I just rm -rf * the NFS volume?

jhamman commented 4 years ago

Should I just rm -rf * the NFS volume?

@rabernat - let's leave it for a few days. I actually think we'll want to create a new (smaller) nfs service so we may just remove the existing one all together.

rabernat commented 4 years ago

@jhamman -- let me know when you're ready for me to transfer the migrated ocean.pangeo.io users to the new NFS server.

rabernat commented 4 years ago

What's the status today? Are we ready to starting bringing up the new cluster?

For DNS, I suggest we go with the region-based names, i.e. us-central-1b.gcp.pangeo.io.

jhamman commented 4 years ago

Update...

@TomAugspurger and I have been working on standing up the new hub. This is going well and we should be ready for the user home directories now at the following NFS location:

10.126.142.50:/home/uscentral1b/{GITHUB_USER}

@rabernat - we're also ready to configure Auth0 and the DNS record. I can't do this because my access to the Pangeo Auth0 account is still broken.

The branch to work off right now is: https://github.com/pangeo-data/pangeo-cloud-federation/pull/626

salvis2 commented 4 years ago

Do the GCP clusters use NFS Provisioner for making new user home directories? There is a way to run the binary apparently that can enforce user quotas: https://github.com/kubernetes-incubator/external-storage/blob/master/nfs/docs/deployment.md#outside-of-kubernetes---binary

This doesn't appear to be an option in NFS-Client Provisioner. I'm a little fuzzy on the distinction between the two, but the first link is the only thing I could find on quotas. Linux hacking has yet to yield anything useful.

rabernat commented 4 years ago

Do the GCP clusters use NFS Provisioner for making new user home directories

I'm not sure. All I know is that they use NFS for home directories. The chart is in #262

we should be ready for the user home directories now at the following NFS location:

On it.

we're also ready to configure Auth0 and the DNS record

Do we have an IP address for the DNS record?

rabernat commented 4 years ago

I have hit a challenge with the NFS server permissions, described in #627. Any ideas would be appreciated.

rabernat commented 4 years ago

Home directories are now (or will soon be) working.

TomAugspurger commented 4 years ago

The dask side of things is up now.

I'm not familiar with how we did DNS before. Do we need to reserve some address in GCP? RIght now the hub's IP is 34.69.173.244.

TomAugspurger commented 4 years ago

Telemetry stuff seems to work at a glance. We'll need to talk about what if anything should be public.

If you want to mess with grafana the steps currently are

cd deployments/gcp-uscentral1b
# get the password
# remove the | pbcopy if you aren't on a Mac
make print-grafana-password | pbcopy
# tunnel into the grafana server
make forward-grafana

Then login with username: admin and the password that should be on your clipboard. I think we'll eventually hook grafana up to some auth system like GitHub.

rabernat commented 4 years ago

I'm not familiar with how we did DNS before. Do we need to reserve some address in GCP? RIght now the hub's IP is 34.69.173.244.

There is some way to convert this to a permanent IP address.

TomAugspurger commented 4 years ago

Thanks, done. It's named gcp-uscentral1b.

rabernat commented 4 years ago

Yikes, I just tried doing it at the same time. I got an error "The request contains invalid arguments: "Invalid value for field 'resource.address': '34.69.173.244'. Specified IP address is already reserved.". Error code: "invalid""

rabernat commented 4 years ago

We also have to add this IP to the chart somewhere.

rabernat commented 4 years ago

I'll work on DNS.

TomAugspurger commented 4 years ago

Thanks, sorry for the duplicate work there. Added the stable IP to the chart in https://github.com/pangeo-data/pangeo-cloud-federation/pull/628.

rabernat commented 4 years ago

This is what I see now on https://console.cloud.google.com/networking/addresses/list?project=pangeo-181919

Looks like a created a problem.

TomAugspurger commented 4 years ago

Hmm I don't see any issues.

Screen Shot 2020-06-26 at 1 12 07 PM

rabernat commented 4 years ago

I'm stumped on some conceptual things for auth0. If anyone wants to hop into https://whereby.com/pangeo for a chat, I'd love to bounce some ideas around.

rabernat commented 4 years ago

Also, DNS is up (http://staging.us-central1-b.gcp.pangeo.io/) but https is not yet configured. Does anyone know how to do this?

TomAugspurger commented 4 years ago

HTTPs may just be a matter of uncommenting https://github.com/pangeo-data/pangeo-cloud-federation/blob/0c33675fa235fdf4a9c88f8daf6ec00ee01d22ad/deployments/gcp-uscentral1b/config/staging.yaml#L4-L8? Maybe updating the email?

On Fri, Jun 26, 2020 at 2:19 PM Ryan Abernathey notifications@github.com wrote:

Also, DNS is up (http://staging.us-central1-b.gcp.pangeo.io/) but https is not yet configured. Does anyone know how to do this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/622#issuecomment-650353048, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIVZX3NHRPEGNTQHMN3RYTYDXANCNFSM4OA475YA .

salvis2 commented 4 years ago

I believe you are supposed to first get the hub up-and-running without HTTPS, do some DNS pointing, then enable HTTPS. https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/security.html#https

It looks like prod had the HTTPS block always enabled: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/gcp-uscentral1b/config/prod.yaml#L5-L8

consideRatio commented 4 years ago

If HTTPS doesnt configure itself properly, I know that it could be needed to delete a secret named something like hub-proxy-tls and then delete the autohttps pod.

jhamman commented 4 years ago

My update from today:

staging

https://staging.us-central1-b.gcp.pangeo.io/ is now live and is using Pangeo's Auth0 account.

For the staging hub, the main thing to sort out is the dask gateway service. @rabernat and I were getting the following error when we took the hub for a test drive:

ClientResponseError: 503, message='Service Unavailable', url=URL('https://staging.us-central1-b.gcp.pangeo.io/services/dask-gateway/api/v1/clusters/')

prod

I added the config for https://us-central1-b.gcp.pangeo.io/ to the staging branch but I didn't manage to get it deployed. Currently running into an issue with lingering k8s resources:

Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: namespace: , name: us-central1b-prod-grafana, existing_kind: policy/v1beta1, Kind=PodSecurityPolicy, new_kind: policy/v1beta1, Kind=PodSecurityPolicy

@TomAugspurger - this looks familiar to what we saw yesterday, no?

TomAugspurger commented 4 years ago

I thought I fixed the 503 error for gateway. Can you make sure you pulled staging before helm deploying?

TomAugspurger commented 4 years ago

I redeployed from staging. Things seem to be OK.

Not sure about prod right now.

rabernat commented 4 years ago

Is there a public endpoint for the grafana dashboards?

salvis2 commented 4 years ago

Is there a public endpoint for the grafana dashboards?

Grafana should have an External-IP / service. I know you can put a DNS address to point to it but I'm still fuzzy on doing HTTPS with it through JupyterHub. @consideRatio could probably speak to that more if you are curious.

You can enable anonymous logins for Grafana and configure what anonymous users are able to see via settings on their organization role.

rabernat commented 4 years ago

Ah ok I I just figured out how to see grafana locally (actually read @TomAugspurger's comment in . https://github.com/pangeo-data/pangeo-cloud-federation/issues/622#issuecomment-650311119).

I can now see a basic Grafana interface, but it doesn't have any dashboards and I don't know how to create one. Is there an issue to discuss that?

TomAugspurger commented 4 years ago

No public dashboard yet. We’ll need to decide if there’s anything that shouldn’t be public.

Right now the dashboards seem to be lost on each helm deploy. Haven’t figured out how to persist them yet.

On Jun 27, 2020, at 17:02, Ryan Abernathey notifications@github.com wrote:

Ah ok I I just figured out how to see grafana locally (actually read @TomAugspurger's comment in . #622 (comment)).

I can now see a basic Grafana interface, but it doesn't have any dashboards and I don't know how to create one. Is there an issue to discuss that?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

salvis2 commented 4 years ago

I think you need to build the dashboards into the Helm release. It's not super clear, but this seems to be somewhere to start: https://github.com/helm/charts/tree/master/stable/grafana#import-dashboards

jhamman commented 4 years ago

https://us-central1-b.gcp.pangeo.io is now up

No public dashboard yet. We’ll need to decide if there’s anything that shouldn’t be public.

@consideRatio - do you know if it is possible (or what it would take) to put grafana behind the admin permissions of a jupyterhub service?

pangeo-data / pangeo-cloud-federation

ocean.pangeo.io maintenance hack session #622

staging

prod