prod deployment failed on both AWS and GCP

scottyhq commented 3 years ago

https://github.com/pangeo-data/pangeo-binder/pull/177

circleci logs https://app.circleci.com/pipelines/github/pangeo-data/pangeo-binder/174/workflows/14589804-6a0a-4ba6-86e9-f642de344f22/jobs/181

:( unfortunately i don't have time to dig into this today @TomAugspurger . I believe even if the deployment fails we're still operational b/c helm sticks with the last version though, correct?

TomAugspurger commented 3 years ago

Yeah I can look into it.

We should be OK at the moment though. Seems that we can still launch sessions.

TomAugspurger commented 3 years ago

At least for GCP, I'm seeing things like

0/2 nodes are available: 1 node(s) didn't match node selector, 2 node(s) didn't have free ports for the requested pod ports.

from the node-exporter pods. The idea is to have one of those present on each pod to export statistics. But when two are assigned to a pod they conflict.

Right now we deploy these metrics as part of CI/CD, and prometheus is a proper dependency in the requirements.yaml. I'd propose handling metrics outside of the regular deployments, just do them manually as necessary. Probably cleanest to do in a separate namespace.

TomAugspurger commented 3 years ago

Both deployed fine on staging: https://app.circleci.com/pipelines/github/pangeo-data/pangeo-binder/175/workflows/870a20a0-b68d-4a86-9313-42e0e3be522e/jobs/182

Trying out prod now.

TomAugspurger commented 3 years ago

https://app.circleci.com/pipelines/github/pangeo-data/pangeo-binder/176/workflows/7a9b9b0b-6915-4433-8636-b0ab373e1315/jobs/183 passed, so I think we're good here.

I need to sort out the grafana stuff. It's running, but the public URLs are broken. Kubernetes ingress is still a mystery to me.

scottyhq commented 3 years ago

Thanks Tom, much appreciated! I've confirmed that the gateway env vars can now be set.

I need to sort out the grafana stuff. It's running, but the public URLs are broken. Kubernetes ingress is still a mystery to me.

Me too! @consideRatio could probably help guide us.

consideRatio commented 3 years ago

I could not see much output at all, so dare not guess what went wrong. I think there is need for more output on failure. The issue is that by using helm upgrade --wait --install you don't get much output of what went wrong if a pod didn't get into a running running and ready state. I suggest to not use --wait and followup with some basic test to validate things work following that instead, and if it doesn't, print lots of info.

In this script, I have created some functions you could copy for example: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/44a531531a1f6a4a45e86243508a54b183ef56bb/ci/common#L160-L202

I suggest:

Upgrade to Helm 3.4.0 from 3.1.2 (for good measure)
Make https://kubernetes-charts.storage.googleapis.com point to https://charts.helm.sh/stable wherever that is relevant
Use helm lint --strict as a test
Use helm template --validate as well as a test before helm upgrade
Stop using --wait in helm upgrade, but let scripts do the waiting... Or hmmm... maybe keep using wait and do the full_namespace_report on failure?
Was helm charts just installed? Use --create-namespace as well in the helm upgrade --install step.

At this point more information is needed as a first step from the CI system I think.

pangeo-data / pangeo-binder

prod deployment failed on both AWS and GCP #178