Closed rabernat closed 4 years ago
I'll be around for most of the session, but will have to pop out for a couple calls.
I'm looking forward to this hack session today.
Let's jump in https://whereby.com/pangeo to kick things off.
Some working notes here: https://hackmd.io/@U4W-olO3TX-hc-cvbjNe4A/r13p_PRaL/edit
For " write documentation explaining deprecation of Dask Kubernetes and how to use Dask Gateway" we can pull content from https://medium.com/pangeo/pangeo-with-dask-gateway-4b638825f105, specifically https://medium.com/pangeo/pangeo-with-dask-gateway-4b638825f105#af22 for explaining how to transition.
I can look into logging / monitoring things. Both jupyterhub and Dask expose prometheus metrics, and mybinder has details on capturing & visualizing them: https://mybinder-sre.readthedocs.io/en/latest/components/metrics.html
I can look into logging / monitoring things. Both jupyterhub and Dask expose prometheus metrics
That's awesome! What we would like most is to be able to run a query to find a how much time an individual user has accumulated over a given period on both jupyter and dask.
My update from day 1:
Still to do:
update hubploy / circleci configs
This will be useful for just pointing to existing images on DockerHub https://github.com/yuvipanda/hubploy/pull/75
Or if images continue to be built in this repo, it makes sense to put them on DockerHub rather than aws or gcp registries which are harder for people to get to. So could revisit https://github.com/yuvipanda/hubploy/pull/24
The account migration is in progress. Those with credentials can see the backed up homedirs here: https://console.cloud.google.com/storage/browser/pangeo-homedir-backup
There is a long tail of very large home directories on ocean that will take a very long time to complete.
For reference, the backup scripts are here: https://gist.github.com/rabernat/c9b352de926756342e86da662a0eadf9
Or if images continue to be built in this repo, it makes sense to put them on DockerHub rather than aws or gcp registries which are harder for people to get to.
I think we're hoping to still upload to GCP / AWS to keep the startup times as small as possible when an image does need to be downloaded.
Today I'll work on standing up a test cluster and testing that Linux hack to enforce user storage limits.
@salvis2 @rabernat - before you dive into the storage limits, do you have a solution for dealing with the fact that every user has the same uid and gid (1000,1000)? This has come up a few times before https://github.com/pangeo-data/pangeo-cloud-federation/issues/384#issuecomment-526401660 https://github.com/pangeo-data/pangeo-cloud-federation/issues/25
My idea was to try to do the quota-ing from within the user's jupyter pod. Basically, this pod is a unix system with one user--jovyan (1000,1000)--whose home directory is mounted from an nfs server.
Is is possible to make this unix instance enforce a quota on that one user? It doesn't have to know about all the other users or address the challenge of duplicated uid / gid. It just has to prevent jovyan from creating more than 10GB of files in /home/jovyan.
Seems like it should be possible to me, but I have likely overlooked something.
ok. definitely sounds like something worth exploring!
One more idea/request on the topic of "update hubploy / circleci configs". I think it would be great to drop circleci in favor of github actions. Hubploy now works with github actions (for example https://github.com/ICESAT-2HackWeek/jupyterhub-2020). And we could make use of organization level secrets to reduce scattering in various places. https://github.blog/changelog/2020-05-14-organization-secrets/.
I think it would be great to drop circleci in favor of github actions. Hubploy now works with github actions
💯 x 👍
Home directory backup is complete. Should I just rm -rf *
the NFS volume?
Should I just rm -rf * the NFS volume?
@rabernat - let's leave it for a few days. I actually think we'll want to create a new (smaller) nfs service so we may just remove the existing one all together.
@jhamman -- let me know when you're ready for me to transfer the migrated ocean.pangeo.io users to the new NFS server.
What's the status today? Are we ready to starting bringing up the new cluster?
For DNS, I suggest we go with the region-based names, i.e. us-central-1b.gcp.pangeo.io
.
Update...
@TomAugspurger and I have been working on standing up the new hub. This is going well and we should be ready for the user home directories now at the following NFS location:
10.126.142.50:/home/uscentral1b/{GITHUB_USER}
@rabernat - we're also ready to configure Auth0 and the DNS record. I can't do this because my access to the Pangeo Auth0 account is still broken.
The branch to work off right now is: https://github.com/pangeo-data/pangeo-cloud-federation/pull/626
Do the GCP clusters use NFS Provisioner for making new user home directories? There is a way to run the binary apparently that can enforce user quotas: https://github.com/kubernetes-incubator/external-storage/blob/master/nfs/docs/deployment.md#outside-of-kubernetes---binary
This doesn't appear to be an option in NFS-Client Provisioner. I'm a little fuzzy on the distinction between the two, but the first link is the only thing I could find on quotas. Linux hacking has yet to yield anything useful.
Do the GCP clusters use NFS Provisioner for making new user home directories
I'm not sure. All I know is that they use NFS for home directories. The chart is in #262
we should be ready for the user home directories now at the following NFS location:
On it.
we're also ready to configure Auth0 and the DNS record
Do we have an IP address for the DNS record?
I have hit a challenge with the NFS server permissions, described in #627. Any ideas would be appreciated.
Home directories are now (or will soon be) working.
The dask side of things is up now.
I'm not familiar with how we did DNS before. Do we need to reserve some address in GCP? RIght now the hub's IP is 34.69.173.244.
Telemetry stuff seems to work at a glance. We'll need to talk about what if anything should be public.
If you want to mess with grafana the steps currently are
cd deployments/gcp-uscentral1b
# get the password
# remove the | pbcopy if you aren't on a Mac
make print-grafana-password | pbcopy
# tunnel into the grafana server
make forward-grafana
Then login with username: admin
and the password that should be on your clipboard. I think we'll eventually hook grafana up to some auth system like GitHub.
I'm not familiar with how we did DNS before. Do we need to reserve some address in GCP? RIght now the hub's IP is 34.69.173.244.
There is some way to convert this to a permanent IP address.
Thanks, done. It's named gcp-uscentral1b
.
Yikes, I just tried doing it at the same time. I got an error "The request contains invalid arguments: "Invalid value for field 'resource.address': '34.69.173.244'. Specified IP address is already reserved.". Error code: "invalid""
We also have to add this IP to the chart somewhere.
I'll work on DNS.
Thanks, sorry for the duplicate work there. Added the stable IP to the chart in https://github.com/pangeo-data/pangeo-cloud-federation/pull/628.
This is what I see now on https://console.cloud.google.com/networking/addresses/list?project=pangeo-181919
Looks like a created a problem.
Hmm I don't see any issues.
I'm stumped on some conceptual things for auth0. If anyone wants to hop into https://whereby.com/pangeo for a chat, I'd love to bounce some ideas around.
Also, DNS is up (http://staging.us-central1-b.gcp.pangeo.io/) but https is not yet configured. Does anyone know how to do this?
HTTPs may just be a matter of uncommenting https://github.com/pangeo-data/pangeo-cloud-federation/blob/0c33675fa235fdf4a9c88f8daf6ec00ee01d22ad/deployments/gcp-uscentral1b/config/staging.yaml#L4-L8? Maybe updating the email?
On Fri, Jun 26, 2020 at 2:19 PM Ryan Abernathey notifications@github.com wrote:
Also, DNS is up (http://staging.us-central1-b.gcp.pangeo.io/) but https is not yet configured. Does anyone know how to do this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/622#issuecomment-650353048, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIVZX3NHRPEGNTQHMN3RYTYDXANCNFSM4OA475YA .
I believe you are supposed to first get the hub up-and-running without HTTPS, do some DNS pointing, then enable HTTPS. https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/security.html#https
It looks like prod had the HTTPS block always enabled: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/gcp-uscentral1b/config/prod.yaml#L5-L8
If HTTPS doesnt configure itself properly, I know that it could be needed to delete a secret named something like hub-proxy-tls and then delete the autohttps pod.
My update from today:
https://staging.us-central1-b.gcp.pangeo.io/ is now live and is using Pangeo's Auth0 account.
For the staging hub, the main thing to sort out is the dask gateway service. @rabernat and I were getting the following error when we took the hub for a test drive:
ClientResponseError: 503, message='Service Unavailable', url=URL('https://staging.us-central1-b.gcp.pangeo.io/services/dask-gateway/api/v1/clusters/')
I added the config for https://us-central1-b.gcp.pangeo.io/ to the staging
branch but I didn't manage to get it deployed. Currently running into an issue with lingering k8s resources:
Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: namespace: , name: us-central1b-prod-grafana, existing_kind: policy/v1beta1, Kind=PodSecurityPolicy, new_kind: policy/v1beta1, Kind=PodSecurityPolicy
@TomAugspurger - this looks familiar to what we saw yesterday, no?
I thought I fixed the 503 error for gateway. Can you make sure you pulled staging before helm deploying?
I redeployed from staging. Things seem to be OK.
Not sure about prod right now.
Is there a public endpoint for the grafana dashboards?
Is there a public endpoint for the grafana dashboards?
Grafana should have an External-IP / service. I know you can put a DNS address to point to it but I'm still fuzzy on doing HTTPS with it through JupyterHub. @consideRatio could probably speak to that more if you are curious.
You can enable anonymous logins for Grafana and configure what anonymous users are able to see via settings on their organization role.
Ah ok I I just figured out how to see grafana locally (actually read @TomAugspurger's comment in . https://github.com/pangeo-data/pangeo-cloud-federation/issues/622#issuecomment-650311119).
I can now see a basic Grafana interface, but it doesn't have any dashboards and I don't know how to create one. Is there an issue to discuss that?
No public dashboard yet. We’ll need to decide if there’s anything that shouldn’t be public.
Right now the dashboards seem to be lost on each helm deploy. Haven’t figured out how to persist them yet.
On Jun 27, 2020, at 17:02, Ryan Abernathey notifications@github.com wrote:
Ah ok I I just figured out how to see grafana locally (actually read @TomAugspurger's comment in . #622 (comment)).
I can now see a basic Grafana interface, but it doesn't have any dashboards and I don't know how to create one. Is there an issue to discuss that?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
I think you need to build the dashboards into the Helm release. It's not super clear, but this seems to be somewhere to start: https://github.com/helm/charts/tree/master/stable/grafana#import-dashboards
https://us-central1-b.gcp.pangeo.io is now up
No public dashboard yet. We’ll need to decide if there’s anything that shouldn’t be public.
@consideRatio - do you know if it is possible (or what it would take) to put grafana behind the admin permissions of a jupyterhub service?
As discussed in #616 and https://discourse.pangeo.io/t/migration-of-ocean-pangeo-io-user-accounts/644/15, we will be doing maintenance on ocean.pangeo.io and other GCP clusters next week. @jhamman and I have blocked off Monday, June 22, 2-5pm EDT for a sprint on this. I invite everyone, and in particular @TomAugspurger, @scottyhq, @salvis2, @consideRatio, and @yuvipanda to help us out with this.
Some of the things we need to do are:
What am I missing from this list?