Turing cluster un-upgradeable

sgibson91 commented 4 years ago

I think this may be due to the upgrade to k8s v1.16.9

Error message:

Error: UPGRADE FAILED: Upgrade --force successfully deleted the previous release, but encountered 5 error(s) and cannot continue: unable to recognize "": no matches for kind "Deployment" in version "extensions/v1beta1"; unable to recognize "": no matches for kind "Deployment" in version "extensions/v1beta1"; unable to recognize "": no matches for kind "Deployment" in version "extensions/v1beta1"; unable to recognize "": no matches for kind "Deployment" in version "extensions/v1beta1"; unable to recognize "": no matches for kind "DaemonSet" in version "extensions/v1beta1"

We can see that extensions/v1beta1 has been deprecated here.

Does anyone have any suggestions to try before I do helm delete turing --purge?

Related: #1482 #1484

manics commented 4 years ago

Sounds like BinderHub was already fixed to work with k8s 1.16 in November 2019:

I guess there's no harm in trying a delete.

consideRatio commented 4 years ago

Not sure. I would do a manual inspection to delete k8s resources that may be registered with the k8s api-server and try cleanup. They may not be found any more without specifying them with the full name. Instead of kubectl get deployments, you may need kubectl get deployments.extensions.v1beta or something like that (i dont remember this syntax, but resource names are abbreviations for their kind followed by apigroup etc, which has changed whats supported etc)

betatim commented 4 years ago

My hunch is that the problem is that resources of type extensions/v1beta1 were created when the cluster was still on a version where this type was still allowed. Then the version of k8s got upgraded to one where extensions/v1beta1 is no longer supported. Now we are trying to deploy a helm chart that doesn't contain any resources of type extensions/v1beta1 any more but the current objects are of that type. So now during the deploy process helm is trying to find objects of that type, and failing.

One thing that I find helpful is to look at the log of the tiller pod while the deploy is going. It often contains more useful information than what helm prints on the console.

sgibson91 commented 4 years ago

Thanks @betatim, watching the tiller pod during an upgrade attempt seems to suggest that it can't find anything. I think a fresh install is best.

sgibson91 commented 4 years ago

I keep getting the following despite having run helm delete turing --purge and kubectl delete namespace turing, followed by chartpress and deploy.py again

Error: validation failed: [unable to recognize "": no matches for kind "DaemonSet" in version "extensions/v1beta1", unable to recognize "": no matches for kind "Deployment" in version "extensions/v1beta1"]

Checked that the config maps and tiller pods where both available

$ kubectl get configmaps -n kube-system
NAME                                           DATA   AGE
azure-ip-masq-agent-config                     1      184d
cert-manager-cainjector-leader-election        0      183d
cert-manager-cainjector-leader-election-core   0      183d
cert-manager-controller                        0      183d
cluster-autoscaler-status                      1      15h
container-azm-ms-aks-k8scluster                1      17d
coredns                                        2      184d
coredns-autoscaler                             1      184d
coredns-custom                                 0      184d
extension-apiserver-authentication             6      184d
omsagent-rs-config                             1      17d
tunnelfront-kubecfg                            3      184d

$ kubectl get pods -n kube-system | grep ^tiller
tiller-deploy-77d5bddbc9-tvl6h               1/1     Running   0          7d5h

And this is the point were @betatim and I ran out of ideas :(

sgibson91 commented 4 years ago

out.txt Tiller logs from the date @manics notified me the Turing failed. he pinged me around 14:53 BST. @minrk don't know if this will be useful?

minrk commented 4 years ago

after running helm template, I saw that most (maybe all) of these extensions/v1beta1 were in the prometheus chart. Bumping that to the latest version may solve the issue.

The turing namespace being stuck in a terminating state, unable to delete challenge.acme.cert-manager.io/kubelego-tls-redirector-4177871607-3074425277-1319339884 might be another issue that will prevent deploy, though.

minrk commented 4 years ago

This tip to allow the resource to be deleted by skipping its finalizers (probably not great, but 🤷) has allowed the turing namespace to be deleted. Hopefully this means it will come back on the next deploy.

sgibson91 commented 4 years ago

~For my future reference, deleting crds that won't leave the deletion of the namespace stuck in terminating state: https://github.com/jetstack/cert-manager/issues/1582#issuecomment-546615811~

Min types faster than me 😄

minrk commented 4 years ago

Upgrading prometheus helped, but nginx-ingress also needs to be upgraded for the same reason (our version is very old). Unfortunately, when I tried to deploy that (#1494) everything went down, so I'm reverting it now since the solution is not obvious and I don't want to be fiddling it while SciPy is going on.

Something we should fix, though: staging did not deploy with valid SSL certificates, but it passed tests and allowed deployment to prod. I don't think this should have happened.

sgibson91 commented 4 years ago

Just a note that the Turing cluster is now out of money so this issue will be on hold until we can access the cluster again :) https://github.com/jupyterhub/mybinder.org-deploy/issues/1545

sgibson91 commented 4 years ago

So I pulled all the changes, scrubbed the Turing cluster and tried redeploying (with cert-manager in it's own namespace etc etc), I'm still seeing this eror:

Error: validation failed: unable to recognize "": no matches for kind "Deployment" in version "extensions/v1beta1"

Everything in cert-manager ns seems to be running

kubectl get pods -n cert-manager
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-5f88867445-79rx8              1/1     Running   0          17m
cert-manager-cainjector-8489bd8b59-stl8z   1/1     Running   0          17m
cert-manager-webhook-55fc8db98-2hb5j       1/1     Running   0          17m

consideRatio commented 4 years ago

There is apparently some k8s deployment resource that has the old apiVersion of extensions/v1beta1 which has been deprecated for a very long time.

Also, i see the cert-manager webhook. Id recomme d disabling that. Its only used for validation of its own resource kinds but can cause failures of its own by the added conplexity.

Id make a proper look for anything coming from a 'helm template' render command to include extensions/v1beta1, for example using | grep

consideRatio commented 4 years ago

Also hmmm, this could be a consequence of a kubectl version different from the k8s cluster, perhaps too new.

sgibson91 commented 4 years ago

There is apparently some k8s deployment resource that has the old apiVersion of extensions/v1beta1 which has been deprecated for a very long time.

Yeah, unfortunately I thought we'd fixed this by upgrading prometheus and nginx-ingress 🙁 I'll see what else helm template throws up

Also hmmm, this could be a consequence of a kubectl version different from the k8s cluster, perhaps too new.

Ok, I've now made sure my kubectl matches and will try again

sgibson91 commented 4 years ago

Ok, we have references to extensions/v1beta1 in the following files:

config/ovh/ovh_mybinder_org_ingress.yaml
mybinder/templates/federation-redirect/ingress.yaml
mybinder/templates/gcs-proxy/ingress.yaml
mybinder/templates/matomo/ingress.yaml
mybinder/templates/redirector/ingress.yaml
mybinder/templates/static/ingress.yaml

consideRatio commented 4 years ago

The ingress resources of yhe extensions/v1beta1 api are deprecated, and the new version which i think is networking.k8s.io/v1 if i dont remember incorrectly. They are still functional until i think k8s 1.20 though. They should be treated like in https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/1718 to avoid future issues but the issue experienced probably isnt related to the ingress resources.

Ah i see now the PR min made, i think only the deployment resource were removed in 1.16 though and not also the ingress, which is to be removed in 1.20 afaik

minrk commented 4 years ago

@consideRatio Thanks for that info. We don't need a conditional on the apiVersion here because the networking.k8s.io is introduced in kubernetes 1.14 and our federation can safely require at least 1.15.

FWIW, the [1.16](deprecation docs](https://kubernetes.io/blog/2019/07/18/api-deprecations-in-1-16/) mention this ingress removal will be in 1.22.

@sgibson91 is it possible you had old chart dependencies still trying to install deployments/v1beta1? Try rm -rf mybinder/charts before helm dep up mybinder. I also discovered in https://github.com/jupyterhub/mybinder.org-deploy/pull/1602 that removing --force from our helm upgrade command can produce much more informative errors, for those debugging deployments. I'm not sure if there's something we can easily do to show this info on a failed deployment? E.g. helm upgrade --dry-run .... on failure?

sgibson91 commented 4 years ago

I just recloned the repo to avoid conflicts when pulling down new changes, so that effectively deleted mybinder/chart. I then did the following

helm dep up for mybinder dir
chartpress
Ran deploy.py with the --local flag and helm version 2.16.10 and cert-manager version 0.15.2

deploy.py is now failing at upgrading the the network bans stage, which is different as it was failing at the helm upgrade stage previously

sgibson91 commented 4 years ago

Finally tracked down the error in ban.py

Updating network-bans for turing
Traceback (most recent call last):
  File "deploy.py", line 381, in <module>
    main()
  File "deploy.py", line 377, in main
    deploy(args.release)
  File "deploy.py", line 226, in deploy
    result = run_cmd([
  File "deploy.py", line 52, in run_cmd
    raise Exception(result["err_msg"])
Exception:   File "secrets/ban.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xeb' in file secrets/ban.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

I forgot to unencrypt secrets/ when I recloned

sgibson91 commented 4 years ago

Ok, I redeployed, but now the binder pod is having some trouble starting.

Output of deploy.py:

Waiting for all deployments and daemonsets in turing to be ready
Waiting for deployment "binder" rollout to finish: 0 of 1 updated replicas are available...
error: deployment "binder" exceeded its progress deadline
Traceback (most recent call last):
  File "deploy.py", line 381, in <module>
    main()
  File "deploy.py", line 377, in main
    deploy(args.release)
  File "deploy.py", line 266, in deploy
    subprocess.check_call([
  File "/Users/sgibson/opt/miniconda3/envs/mybinder-deploy/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['kubectl', 'rollout', 'status', '--namespace', 'turing', '--timeout', '5m', '--watch', 'deployment.apps/binder']' returned non-zero exit status 1.

Pods:

NAME                                                 READY   STATUS              RESTARTS   AGE
binder-587c4ff945-cthwv                              0/1     ContainerCreating   0          14m
cm-acme-http-solver-4hhkf                            1/1     Running             0          14m
cm-acme-http-solver-dvjpj                            1/1     Running             0          14m
cm-acme-http-solver-dxlws                            1/1     Running             0          14m
cm-acme-http-solver-f4xwg                            1/1     Running             0          14m
cm-acme-http-solver-hb4jk                            1/1     Running             0          14m
cm-acme-http-solver-jb84w                            1/1     Running             0          14m
cm-acme-http-solver-s9qk2                            1/1     Running             0          14m
cm-acme-http-solver-smd9x                            1/1     Running             0          14m
hub-6d9dc99d8b-hg7qd                                 1/1     Running             0          14m
proxy-7d65b54bbf-rtq9q                               1/1     Running             0          14m
proxy-patches-5d695b96d-5n5gw                        2/2     Running             1          14m
redirector-6cb8676749-dktrv                          1/1     Running             0          14m
turing-dind-4gm5z                                    1/1     Running             0          14m
turing-dind-5llqb                                    1/1     Running             0          14m
turing-dind-rz25k                                    1/1     Running             0          14m
turing-grafana-b5bbd4f66-7dmph                       1/1     Running             0          14m
turing-image-cleaner-2gb5q                           1/1     Running             0          14m
turing-image-cleaner-5q2j7                           1/1     Running             0          14m
turing-image-cleaner-grthn                           1/1     Running             0          14m
turing-ingress-nginx-controller-6cd966966-d4gm7      1/1     Running             0          14m
turing-ingress-nginx-controller-6cd966966-rgpzq      1/1     Running             0          14m
turing-ingress-nginx-controller-6cd966966-z9dtq      1/1     Running             0          14m
turing-ingress-nginx-defaultbackend-54f76fb9-szdbp   1/1     Running             0          14m
turing-kube-state-metrics-654944f9f-q688n            1/1     Running             0          14m
turing-prometheus-node-exporter-9kxlb                1/1     Running             0          14m
turing-prometheus-node-exporter-j775c                1/1     Running             0          14m
turing-prometheus-node-exporter-lkr9q                1/1     Running             0          14m
turing-prometheus-server-bf5c86687-n76kj             2/2     Running             0          14m
user-placeholder-0                                   1/1     Running             0          14m
user-placeholder-1                                   1/1     Running             0          14m
user-placeholder-2                                   1/1     Running             0          14m
user-placeholder-3                                   1/1     Running             0          14m
user-placeholder-4                                   1/1     Running             0          14m

Binder pod:

Name:           binder-587c4ff945-cthwv
Namespace:      turing
Priority:       0
Node:           aks-user-14930255-vmss000001/10.240.0.35
Start Time:     Sun, 13 Sep 2020 12:20:02 +0100
Labels:         app=binder
                component=binder
                heritage=Tiller
                name=binder
                pod-template-hash=587c4ff945
                release=turing
Annotations:    checksum/config-map: ec0f65403e38e5874886f2f65c3807122117680038160ebf462e15d14dbd478d
                checksum/secret: 1303627003b538bfa215035e67dc6d77447a06de8ac9613485c7d770a7360398
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/binder-587c4ff945
Containers:
  binder:
    Container ID:
    Image:          jupyterhub/k8s-binderhub:0.2.0-n217.h35366ea
    Image ID:
    Port:           8585/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:     250m
      memory:  1Gi
    Liveness:  http-get http://:binder/about delay=10s timeout=10s period=5s #success=1 #failure=3
    Environment:
      BUILD_NAMESPACE:                 turing (v1:metadata.namespace)
      JUPYTERHUB_API_TOKEN:            <set to the key 'binder.hub-token' in secret 'binder-secret'>  Optional: false
      GOOGLE_APPLICATION_CREDENTIALS:  /event-secret/service-account.json
    Mounts:
      /etc/binderhub/config/ from config (rw)
      /etc/binderhub/secret/ from secret-config (rw)
      /event-secret from event-secret (ro)
      /root/.docker from docker-secret (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from binderhub-token-sf9xg (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      binder-config
    Optional:  false
  secret-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  binder-secret
    Optional:    false
  docker-secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  binder-push-secret
    Optional:    false
  event-secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  events-archiver-secret
    Optional:    false
  binderhub-token-sf9xg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  binderhub-token-sf9xg
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason       Age                 From                                   Message
  ----     ------       ----                ----                                   -------
  Normal   Scheduled    15m                 default-scheduler                      Successfully assigned turing/binder-587c4ff945-cthwv to aks-user-14930255-vmss000001
  Warning  FailedMount  15m                 kubelet, aks-user-14930255-vmss000001  MountVolume.SetUp failed for volume "binderhub-token-sf9xg" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount  15m                 kubelet, aks-user-14930255-vmss000001  MountVolume.SetUp failed for volume "event-secret" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount  15m                 kubelet, aks-user-14930255-vmss000001  MountVolume.SetUp failed for volume "docker-secret" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount  15m                 kubelet, aks-user-14930255-vmss000001  MountVolume.SetUp failed for volume "config" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  10m                 kubelet, aks-user-14930255-vmss000001  Unable to attach or mount volumes: unmounted volumes=[event-secret], unattached volumes=[docker-secret event-secret binderhub-token-sf9xg config secret-config]: timed out waiting for the condition
  Warning  FailedMount  4m8s (x4 over 13m)  kubelet, aks-user-14930255-vmss000001  Unable to attach or mount volumes: unmounted volumes=[event-secret], unattached volumes=[config secret-config docker-secret event-secret binderhub-token-sf9xg]: timed out waiting for the condition
  Warning  FailedMount  110s                kubelet, aks-user-14930255-vmss000001  Unable to attach or mount volumes: unmounted volumes=[event-secret], unattached volumes=[secret-config docker-secret event-secret binderhub-token-sf9xg config]: timed out waiting for the condition
  Warning  FailedMount  55s (x14 over 15m)  kubelet, aks-user-14930255-vmss000001  MountVolume.SetUp failed for volume "event-secret" : secret "events-archiver-secret" not found

consideRatio commented 4 years ago

@sgibson91 hmmm, I think that the message ...

kubelet, aks-user-14930255-vmss000001  MountVolume.SetUp failed for volume "event-secret" : failed to sync secret cache: timed out waiting for the condition

... means that kubelet which runs on every node I think, is failing to communicate with the k8s api-server to get information about the available Secrets in the k8s cluster which it wants to mount. If so, why does the kubelet of this node fail to do that?

Hmm... Could it be that this node has a k8s software outdated in comparison to the k8s api-server? I believe that should be fine if its only 1 minor version mismatch, thats what I believe k8s plan to support.
Hmmm... Could it be that the managed connection from this node to the k8s api-server has been influenced, for example by a firewall rule removed? With GKE, the api-server node is in another GCP project managed automatically by GCP, and they automatically create network peering between these networks as well as set up firewall rules to allow communications between them. If such firewall rule or network peering had been been disrupted - then I think you would see something like this.
Hmmm, perhaps the node have entered some bad state that cause kubelet to fail to communicate with the k8s api-server? Perhaps a restart of the node/VM magically solves something?

sgibson91 commented 4 years ago

Hmm... Could it be that this node has a k8s software outdated in comparison to the k8s api-server? I believe that should be fine if its only 1 minor version mismatch, thats what I believe k8s plan to support.

Both client and server are on 1.16.10

Hmmm... Could it be that the managed connection from this node to the k8s api-server has been influenced, for example by a firewall rule removed? With GKE, the api-server node is in another GCP project managed automatically by GCP, and they automatically create network peering between these networks as well as set up firewall rules to allow communications between them. If such firewall rule or network peering had been been disrupted - then I think you would see something like this.

I deployed the virtual network and subnet in the same resource group and haven't touched it since I first deployed this. I guess I could tear everything down and try again? Edited to add: I actually don't think the vnet gives you a firewall by default

Hmmm, perhaps the node have entered some bad state that cause kubelet to fail to communicate with the k8s api-server? Perhaps a restart of the node/VM magically solves something?

Tried this but no magic 😢

consideRatio commented 4 years ago

@sgibson91 do you have a secret named events-archiver-secret in the same namespace as the pod? If i look closer at the events related to the pod that you showed, most FailedMount errors were reported 15m ago, but the most recent successfully concluded that that secret was not found. That means I think that the thing that created the k8s event for the pod about failure to mount due to a secret not found actually got information about the secrets and concluded it wasn't there.

So, perhaps this is caused by a mix of flakey AKS and a missing secret?

kubectl get secret -n turing events-archiver-secret

I'm also very suspicious about the following, could you report back what comes back from running...

set +x

# are these kinds of pods around?
kubectl get pods -A | grep -I csi

# is there a "CSI Driver" (container storage interface)
kubectl get csidriver -o yaml

# hmmm what pods are around in kube-system btw on an AKS cluster?
kubectl get pods -n kube-system

sgibson91 commented 4 years ago

No, there's no events-archiver-secret returned by kubectl -n turing get secrets, but it's defined here in the config: https://github.com/jupyterhub/mybinder.org-deploy/blob/cca0ec556907f9bb102739c40b5d9bfd648227bd/config/turing.yaml#L44-L54

Does it need to be manually added?

kubectl get pods -A | grep -I csi produced no output

kubectl get csidriver -o yaml Output:

apiVersion: v1
items: []
kind: List
metadata:
resourceVersion: ""
selfLink: ""

kubectl get pods -n kube-system

NAME                                         READY   STATUS    RESTARTS   AGE
azure-cni-networkmonitor-lf7q9               1/1     Running   1          8d
azure-cni-networkmonitor-p4k9s               1/1     Running   1          8d
azure-cni-networkmonitor-z7gcp               1/1     Running   1          8d
azure-ip-masq-agent-2wfzd                    1/1     Running   1          8d
azure-ip-masq-agent-pkkdp                    1/1     Running   1          8d
azure-ip-masq-agent-zqkhx                    1/1     Running   1          8d
azure-npm-tnp6d                              1/1     Running   1          5d22h
azure-npm-vffhf                              1/1     Running   1          5d22h
azure-npm-wbdcx                              1/1     Running   1          5d22h
coredns-869cb84759-b6kkm                     1/1     Running   1          8d
coredns-869cb84759-f78cc                     1/1     Running   1          8d
coredns-autoscaler-5b867494f-pqvgp           1/1     Running   1          8d
dashboard-metrics-scraper-566c858889-8c5rp   1/1     Running   1          8d
kube-proxy-9tzqs                             1/1     Running   1          8d
kube-proxy-njqv6                             1/1     Running   1          8d
kube-proxy-q27ql                             1/1     Running   1          8d
kubernetes-dashboard-7f7d6bbd7f-5chb6        1/1     Running   1          8d
metrics-server-5f4c878d8-tstdh               1/1     Running   0          2d1h
tiller-deploy-64c6dd8d6b-8vzxq               1/1     Running   1          7d12h
tunnelfront-7cb79788bd-xcbws                 1/1     Running   0          2d1h

consideRatio commented 4 years ago

@sgibson91 ah k8s reacts on the pod specifically trying to mount it. If it's not created as part of the Helm chart or similar, then its an issue.

I did a search in mybinder.org-deploy and only found it declared for use specifically in the turing deployment where you referenced it. I don't know what logic is injected into the turing binderhub pod that relies on the mounted GCP service account in /event-secret/service-account.json but one should probably git blame to find when and why those configuration lines were added.

consideRatio commented 4 years ago

@sgibson91 ah k8s reacts on the pod specifically trying to mount it. If it's not created as part of the Helm chart or similar, then its an issue.

I did a search in mybinder.org-deploy and only found it declared for use specifically in the turing deployment where you referenced it. I don't know what logic is injected into the turing binderhub pod that relies on the mounted GCP service account in /event-secret/service-account.json but one should probably git blame to find when and why those configuration lines were added.

This was the commit: https://github.com/jupyterhub/mybinder.org-deploy/commit/ab87dd2ce17451ccc3b85c0ee868b08968a50a50

This was the PR: https://github.com/jupyterhub/mybinder.org-deploy/pull/1339

sgibson91 commented 4 years ago

Ah, so given we're also migrating GKE projects, this service account creation may need to be repeated. I scrubbed the old cluster because we had an unused node floating about that was being a pain to delete and I didn't want it soaking up money. Thanks for your help tracking this down ❤️

consideRatio commented 4 years ago

Wieee :)

Btw @sgibson91 @betatim and others, perhaps I could get a git-crypt key sent to me? While I'm not active in making dpeloyments (yet), its useful for my ability to debug and review if there is something to improve.

sgibson91 commented 4 years ago

I can send it across to you if/when others +1 :)

betatim commented 4 years ago

Ah, so given we're also migrating GKE projects, this service account creation may need to be repeated.

The service account should still exist in the original GKE project. Once we complete the move to the new GKE Project we will have to recreate the service accounts and update the secrets in the OVH, Gesis and Turing clusters.

The service account is https://console.cloud.google.com/iam-admin/serviceaccounts/details/100212157396162800340?project=binder-prod but we didn't add the key associated with it to the repository. For the OVH equivalent we did (grep for the key ID listed in the OVH stackdriver service account). So I think the next step is to add a new key to the service account, delete the old key, then add the JSON version of the key to the secrets/ directory and then setup the Secret in the Turing cluster. For OVH the key is referenced from the chart so maybe we can copy that for the Turing setup as well.

sgibson91 commented 4 years ago

Side thought. Is this service account for sending details to the logger that we lost access to because we thought we could ignore the email?

betatim commented 4 years ago

No, the analytics archive still works. What stopped working is streaming logs from all our pods to Stackdriver.

minrk commented 4 years ago

the events-archiver mount issue was a typo, which I also ran into in prod and fixed in #1624 (it was secrets not secret). #1620 updated the secret to send events to the new events-archive for the n3ew GKE project, so I think the only thing left to do for adding turing back to the federation is adding the DNS records for *.mybinder.turing.ac.uk to point to the cluster's external ip.

sgibson91 commented 4 years ago

Hooray! The upgrade worked! 🎉 I've now set the DNS records for *.mybinder.turing.ac.uk to point to the cluster IP in Azure

sgibson91 commented 4 years ago

PR to reinstate the Turing is here: https://github.com/jupyterhub/mybinder.org-deploy/pull/1637/

consideRatio commented 4 years ago

Wieee! :D Nice work @sgibson91!

sgibson91 commented 4 years ago

And to you and @minrk !

minrk commented 4 years ago

Yay!

I also just noticed this comment about the cert-manager webhook from @consideRatio:

Also, i see the cert-manager webhook. I'd recommend disabling that

The cert-manager webhook is no longer optional in cert-manager >= 0.14, so disabling it is not an option. This is indeed part of what made upgrading cert-manager complicated because it's hardcoded in a few places that cert-manager is running in the cert-manager namespace. The way we worked around this limitation is to follow cert-manager's own guide that says cert-manager must not be a dependency of another chart, and installed on its own in the cert-manager namespace. Then everything works fine.

jupyterhub / mybinder.org-deploy

Turing cluster un-upgradeable #1485