Open scottyhq opened 3 years ago
Yeah I can look into it.
We should be OK at the moment though. Seems that we can still launch sessions.
At least for GCP, I'm seeing things like
0/2 nodes are available: 1 node(s) didn't match node selector, 2 node(s) didn't have free ports for the requested pod ports.
from the node-exporter
pods. The idea is to have one of those present on each pod to export statistics. But when two are assigned to a pod they conflict.
Right now we deploy these metrics as part of CI/CD, and prometheus
is a proper dependency in the requirements.yaml
. I'd propose handling metrics outside of the regular deployments, just do them manually as necessary. Probably cleanest to do in a separate namespace.
Both deployed fine on staging: https://app.circleci.com/pipelines/github/pangeo-data/pangeo-binder/175/workflows/870a20a0-b68d-4a86-9313-42e0e3be522e/jobs/182
Trying out prod now.
https://app.circleci.com/pipelines/github/pangeo-data/pangeo-binder/176/workflows/7a9b9b0b-6915-4433-8636-b0ab373e1315/jobs/183 passed, so I think we're good here.
I need to sort out the grafana stuff. It's running, but the public URLs are broken. Kubernetes ingress is still a mystery to me.
Thanks Tom, much appreciated! I've confirmed that the gateway env vars can now be set.
I need to sort out the grafana stuff. It's running, but the public URLs are broken. Kubernetes ingress is still a mystery to me.
Me too! @consideRatio could probably help guide us.
I could not see much output at all, so dare not guess what went wrong. I think there is need for more output on failure. The issue is that by using helm upgrade --wait --install
you don't get much output of what went wrong if a pod didn't get into a running running and ready state. I suggest to not use --wait
and followup with some basic test to validate things work following that instead, and if it doesn't, print lots of info.
In this script, I have created some functions you could copy for example: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/44a531531a1f6a4a45e86243508a54b183ef56bb/ci/common#L160-L202
I suggest:
helm lint --strict
as a testhelm template --validate
as well as a test before helm upgrade
--wait
in helm upgrade
, but let scripts do the waiting... Or hmmm... maybe keep using wait and do the full_namespace_report
on failure?--create-namespace
as well in the helm upgrade --install
step.At this point more information is needed as a first step from the CI system I think.
https://github.com/pangeo-data/pangeo-binder/pull/177
circleci logs https://app.circleci.com/pipelines/github/pangeo-data/pangeo-binder/174/workflows/14589804-6a0a-4ba6-86e9-f642de344f22/jobs/181
:( unfortunately i don't have time to dig into this today @TomAugspurger . I believe even if the deployment fails we're still operational b/c helm sticks with the last version though, correct?