pangeo-data / pangeo-binder

Pangeo + Binder (dev repo for a binder/pangeo fusion concept)
http://binder.pangeo.io
BSD 3-Clause "New" or "Revised" License
18 stars 13 forks source link

Switch to Individual Prometheus / Grafana Charts #172

Closed salvis2 closed 4 years ago

salvis2 commented 4 years ago

Closes #166 . See that issue for background on why the other helm charts are no good.

The old stable helm charts for Prometheus and Grafana have been deprecated, the new ones are here:

pangeo-binder/requirements.yaml and k8s-aws/readme.md have been updated with the new helm repos. Both charts are pinned to their most recent versions.

Configuration changes were mostly untabbing, since there is no longer an umbrella chart for these two. However, I changed Grafana's ingress configuration from the example in the Grafana readme, and suddenly it works with HTTPS. I also had to manually fill in the location of the Prometheus data source. The only way I could get that connected was by using the ClusterIP I find with kubectl get svc -n staging staging-prometheus-server. I moved the datasource configuration to the secrets file, since now it will depend on the cluster and should not be a publicly-known address.

For now, since we only need the one deployment, I will leave it on staging and remove the AWS prod config for monitoring. @TomAugspurger let me know if you'd like me to set up the equivalent config for GCP. I should be able to log in and test manual deployment on staging / get the ClusterIP myself.

salvis2 commented 4 years ago

I will also update the CI and fix that merge conflict.

salvis2 commented 4 years ago

Monitoring deployments are up!

Both have HTTPS, which is cool. However, the GCP site seems to have almost no data. If I launch the dashboard "Cluster Monitoring for Kubernetes" (you should see it if you go to Dashboards > Manage), I find many pods with the AWS graphs, but only one on GCP. That single "pod" is named "Value" but is using quite a bit of memory, so I'm suspecting this is just the total.

A thought that I had when looking at the GCP binder cluster: I think there's a config error with some ingress bit on staging. I see kubectl get svc -n staging binderhub-proxy-nginx-ingress-controller has an EXTERNAL-IP of <pending>. There is also a Service binderhub-proxy-nginx-ingress-default-backend. There are similar Services but with names that start with staging or prod instead of binderhub-proxy on their respective namespaces. The proper services are also on AWS.

On both staging deployments, I ran kubectl get deployments -n staging. The GCP one appears to have these extra nginx-ingress bits that are 41 days old. Maybe they were manually deployed by accident? Do you think I can just delete them @TomAugspurger ?

TomAugspurger commented 4 years ago

It's definitely possible that I messed up the ingress stuff. I'm fine with you deleting them, and if something breaks then I can take a look.

salvis2 commented 4 years ago

Deleted those Services / Deployments and nothing broke. The dashboards still have wildly different levels of details though.

TomAugspurger commented 4 years ago

Thanks for working on this. Feel free to merge when you're ready.