Closed sgibson91 closed 4 years ago
Sounds like BinderHub was already fixed to work with k8s 1.16 in November 2019:
I guess there's no harm in trying a delete.
Not sure. I would do a manual inspection to delete k8s resources that may be registered with the k8s api-server and try cleanup. They may not be found any more without specifying them with the full name. Instead of kubectl get deployments, you may need kubectl get deployments.extensions.v1beta or something like that (i dont remember this syntax, but resource names are abbreviations for their kind followed by apigroup etc, which has changed whats supported etc)
My hunch is that the problem is that resources of type extensions/v1beta1
were created when the cluster was still on a version where this type was still allowed. Then the version of k8s got upgraded to one where extensions/v1beta1
is no longer supported. Now we are trying to deploy a helm chart that doesn't contain any resources of type extensions/v1beta1
any more but the current objects are of that type. So now during the deploy process helm is trying to find objects of that type, and failing.
One thing that I find helpful is to look at the log of the tiller pod while the deploy is going. It often contains more useful information than what helm
prints on the console.
Thanks @betatim, watching the tiller pod during an upgrade attempt seems to suggest that it can't find anything. I think a fresh install is best.
I keep getting the following despite having run helm delete turing --purge
and kubectl delete namespace turing
, followed by chartpress and deploy.py again
Error: validation failed: [unable to recognize "": no matches for kind "DaemonSet" in version "extensions/v1beta1", unable to recognize "": no matches for kind "Deployment" in version "extensions/v1beta1"]
Checked that the config maps and tiller pods where both available
$ kubectl get configmaps -n kube-system
NAME DATA AGE
azure-ip-masq-agent-config 1 184d
cert-manager-cainjector-leader-election 0 183d
cert-manager-cainjector-leader-election-core 0 183d
cert-manager-controller 0 183d
cluster-autoscaler-status 1 15h
container-azm-ms-aks-k8scluster 1 17d
coredns 2 184d
coredns-autoscaler 1 184d
coredns-custom 0 184d
extension-apiserver-authentication 6 184d
omsagent-rs-config 1 17d
tunnelfront-kubecfg 3 184d
$ kubectl get pods -n kube-system | grep ^tiller
tiller-deploy-77d5bddbc9-tvl6h 1/1 Running 0 7d5h
And this is the point were @betatim and I ran out of ideas :(
out.txt Tiller logs from the date @manics notified me the Turing failed. he pinged me around 14:53 BST. @minrk don't know if this will be useful?
after running helm template
, I saw that most (maybe all) of these extensions/v1beta1
were in the prometheus chart. Bumping that to the latest version may solve the issue.
The turing
namespace being stuck in a terminating
state, unable to delete challenge.acme.cert-manager.io/kubelego-tls-redirector-4177871607-3074425277-1319339884
might be another issue that will prevent deploy, though.
This tip to allow the resource to be deleted by skipping its finalizers (probably not great, but 🤷) has allowed the turing namespace to be deleted. Hopefully this means it will come back on the next deploy.
~For my future reference, deleting crds that won't leave the deletion of the namespace stuck in terminating state: https://github.com/jetstack/cert-manager/issues/1582#issuecomment-546615811~
Min types faster than me 😄
Upgrading prometheus helped, but nginx-ingress also needs to be upgraded for the same reason (our version is very old). Unfortunately, when I tried to deploy that (#1494) everything went down, so I'm reverting it now since the solution is not obvious and I don't want to be fiddling it while SciPy is going on.
Something we should fix, though: staging did not deploy with valid SSL certificates, but it passed tests and allowed deployment to prod. I don't think this should have happened.
Just a note that the Turing cluster is now out of money so this issue will be on hold until we can access the cluster again :) https://github.com/jupyterhub/mybinder.org-deploy/issues/1545
So I pulled all the changes, scrubbed the Turing cluster and tried redeploying (with cert-manager in it's own namespace etc etc), I'm still seeing this eror:
Error: validation failed: unable to recognize "": no matches for kind "Deployment" in version "extensions/v1beta1"
Everything in cert-manager ns seems to be running
kubectl get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-5f88867445-79rx8 1/1 Running 0 17m
cert-manager-cainjector-8489bd8b59-stl8z 1/1 Running 0 17m
cert-manager-webhook-55fc8db98-2hb5j 1/1 Running 0 17m
There is apparently some k8s deployment resource that has the old apiVersion of extensions/v1beta1 which has been deprecated for a very long time.
Also, i see the cert-manager webhook. Id recomme d disabling that. Its only used for validation of its own resource kinds but can cause failures of its own by the added conplexity.
Id make a proper look for anything coming from a 'helm template' render command to include extensions/v1beta1, for example using | grep
Also hmmm, this could be a consequence of a kubectl version different from the k8s cluster, perhaps too new.
There is apparently some k8s deployment resource that has the old apiVersion of extensions/v1beta1 which has been deprecated for a very long time.
Yeah, unfortunately I thought we'd fixed this by upgrading prometheus and nginx-ingress 🙁 I'll see what else helm template
throws up
Also hmmm, this could be a consequence of a kubectl version different from the k8s cluster, perhaps too new.
Ok, I've now made sure my kubectl matches and will try again
Ok, we have references to extensions/v1beta1
in the following files:
The ingress resources of yhe extensions/v1beta1 api are deprecated, and the new version which i think is networking.k8s.io/v1 if i dont remember incorrectly. They are still functional until i think k8s 1.20 though. They should be treated like in https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/1718 to avoid future issues but the issue experienced probably isnt related to the ingress resources.
Ah i see now the PR min made, i think only the deployment resource were removed in 1.16 though and not also the ingress, which is to be removed in 1.20 afaik
@consideRatio Thanks for that info. We don't need a conditional on the apiVersion here because the networking.k8s.io
is introduced in kubernetes 1.14 and our federation can safely require at least 1.15.
FWIW, the [1.16](deprecation docs](https://kubernetes.io/blog/2019/07/18/api-deprecations-in-1-16/) mention this ingress removal will be in 1.22.
@sgibson91 is it possible you had old chart dependencies still trying to install deployments/v1beta1? Try rm -rf mybinder/charts
before helm dep up mybinder
. I also discovered in https://github.com/jupyterhub/mybinder.org-deploy/pull/1602 that removing --force
from our helm upgrade
command can produce much more informative errors, for those debugging deployments. I'm not sure if there's something we can easily do to show this info on a failed deployment? E.g. helm upgrade --dry-run ....
on failure?
I just recloned the repo to avoid conflicts when pulling down new changes, so that effectively deleted mybinder/chart
. I then did the following
helm dep up
for mybinder
dirchartpress
deploy.py
with the --local
flag and helm version 2.16.10 and cert-manager version 0.15.2deploy.py
is now failing at upgrading the the network bans stage, which is different as it was failing at the helm upgrade stage previously
Finally tracked down the error in ban.py
Updating network-bans for turing
Traceback (most recent call last):
File "deploy.py", line 381, in <module>
main()
File "deploy.py", line 377, in main
deploy(args.release)
File "deploy.py", line 226, in deploy
result = run_cmd([
File "deploy.py", line 52, in run_cmd
raise Exception(result["err_msg"])
Exception: File "secrets/ban.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xeb' in file secrets/ban.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
I forgot to unencrypt secrets/
when I recloned
Ok, I redeployed, but now the binder pod is having some trouble starting.
Output of deploy.py
:
Waiting for all deployments and daemonsets in turing to be ready
Waiting for deployment "binder" rollout to finish: 0 of 1 updated replicas are available...
error: deployment "binder" exceeded its progress deadline
Traceback (most recent call last):
File "deploy.py", line 381, in <module>
main()
File "deploy.py", line 377, in main
deploy(args.release)
File "deploy.py", line 266, in deploy
subprocess.check_call([
File "/Users/sgibson/opt/miniconda3/envs/mybinder-deploy/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['kubectl', 'rollout', 'status', '--namespace', 'turing', '--timeout', '5m', '--watch', 'deployment.apps/binder']' returned non-zero exit status 1.
Pods:
NAME READY STATUS RESTARTS AGE
binder-587c4ff945-cthwv 0/1 ContainerCreating 0 14m
cm-acme-http-solver-4hhkf 1/1 Running 0 14m
cm-acme-http-solver-dvjpj 1/1 Running 0 14m
cm-acme-http-solver-dxlws 1/1 Running 0 14m
cm-acme-http-solver-f4xwg 1/1 Running 0 14m
cm-acme-http-solver-hb4jk 1/1 Running 0 14m
cm-acme-http-solver-jb84w 1/1 Running 0 14m
cm-acme-http-solver-s9qk2 1/1 Running 0 14m
cm-acme-http-solver-smd9x 1/1 Running 0 14m
hub-6d9dc99d8b-hg7qd 1/1 Running 0 14m
proxy-7d65b54bbf-rtq9q 1/1 Running 0 14m
proxy-patches-5d695b96d-5n5gw 2/2 Running 1 14m
redirector-6cb8676749-dktrv 1/1 Running 0 14m
turing-dind-4gm5z 1/1 Running 0 14m
turing-dind-5llqb 1/1 Running 0 14m
turing-dind-rz25k 1/1 Running 0 14m
turing-grafana-b5bbd4f66-7dmph 1/1 Running 0 14m
turing-image-cleaner-2gb5q 1/1 Running 0 14m
turing-image-cleaner-5q2j7 1/1 Running 0 14m
turing-image-cleaner-grthn 1/1 Running 0 14m
turing-ingress-nginx-controller-6cd966966-d4gm7 1/1 Running 0 14m
turing-ingress-nginx-controller-6cd966966-rgpzq 1/1 Running 0 14m
turing-ingress-nginx-controller-6cd966966-z9dtq 1/1 Running 0 14m
turing-ingress-nginx-defaultbackend-54f76fb9-szdbp 1/1 Running 0 14m
turing-kube-state-metrics-654944f9f-q688n 1/1 Running 0 14m
turing-prometheus-node-exporter-9kxlb 1/1 Running 0 14m
turing-prometheus-node-exporter-j775c 1/1 Running 0 14m
turing-prometheus-node-exporter-lkr9q 1/1 Running 0 14m
turing-prometheus-server-bf5c86687-n76kj 2/2 Running 0 14m
user-placeholder-0 1/1 Running 0 14m
user-placeholder-1 1/1 Running 0 14m
user-placeholder-2 1/1 Running 0 14m
user-placeholder-3 1/1 Running 0 14m
user-placeholder-4 1/1 Running 0 14m
Binder pod:
Name: binder-587c4ff945-cthwv
Namespace: turing
Priority: 0
Node: aks-user-14930255-vmss000001/10.240.0.35
Start Time: Sun, 13 Sep 2020 12:20:02 +0100
Labels: app=binder
component=binder
heritage=Tiller
name=binder
pod-template-hash=587c4ff945
release=turing
Annotations: checksum/config-map: ec0f65403e38e5874886f2f65c3807122117680038160ebf462e15d14dbd478d
checksum/secret: 1303627003b538bfa215035e67dc6d77447a06de8ac9613485c7d770a7360398
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/binder-587c4ff945
Containers:
binder:
Container ID:
Image: jupyterhub/k8s-binderhub:0.2.0-n217.h35366ea
Image ID:
Port: 8585/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 250m
memory: 1Gi
Liveness: http-get http://:binder/about delay=10s timeout=10s period=5s #success=1 #failure=3
Environment:
BUILD_NAMESPACE: turing (v1:metadata.namespace)
JUPYTERHUB_API_TOKEN: <set to the key 'binder.hub-token' in secret 'binder-secret'> Optional: false
GOOGLE_APPLICATION_CREDENTIALS: /event-secret/service-account.json
Mounts:
/etc/binderhub/config/ from config (rw)
/etc/binderhub/secret/ from secret-config (rw)
/event-secret from event-secret (ro)
/root/.docker from docker-secret (ro)
/var/run/secrets/kubernetes.io/serviceaccount from binderhub-token-sf9xg (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: binder-config
Optional: false
secret-config:
Type: Secret (a volume populated by a Secret)
SecretName: binder-secret
Optional: false
docker-secret:
Type: Secret (a volume populated by a Secret)
SecretName: binder-push-secret
Optional: false
event-secret:
Type: Secret (a volume populated by a Secret)
SecretName: events-archiver-secret
Optional: false
binderhub-token-sf9xg:
Type: Secret (a volume populated by a Secret)
SecretName: binderhub-token-sf9xg
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 15m default-scheduler Successfully assigned turing/binder-587c4ff945-cthwv to aks-user-14930255-vmss000001
Warning FailedMount 15m kubelet, aks-user-14930255-vmss000001 MountVolume.SetUp failed for volume "binderhub-token-sf9xg" : failed to sync secret cache: timed out waiting for the condition
Warning FailedMount 15m kubelet, aks-user-14930255-vmss000001 MountVolume.SetUp failed for volume "event-secret" : failed to sync secret cache: timed out waiting for the condition
Warning FailedMount 15m kubelet, aks-user-14930255-vmss000001 MountVolume.SetUp failed for volume "docker-secret" : failed to sync secret cache: timed out waiting for the condition
Warning FailedMount 15m kubelet, aks-user-14930255-vmss000001 MountVolume.SetUp failed for volume "config" : failed to sync configmap cache: timed out waiting for the condition
Warning FailedMount 10m kubelet, aks-user-14930255-vmss000001 Unable to attach or mount volumes: unmounted volumes=[event-secret], unattached volumes=[docker-secret event-secret binderhub-token-sf9xg config secret-config]: timed out waiting for the condition
Warning FailedMount 4m8s (x4 over 13m) kubelet, aks-user-14930255-vmss000001 Unable to attach or mount volumes: unmounted volumes=[event-secret], unattached volumes=[config secret-config docker-secret event-secret binderhub-token-sf9xg]: timed out waiting for the condition
Warning FailedMount 110s kubelet, aks-user-14930255-vmss000001 Unable to attach or mount volumes: unmounted volumes=[event-secret], unattached volumes=[secret-config docker-secret event-secret binderhub-token-sf9xg config]: timed out waiting for the condition
Warning FailedMount 55s (x14 over 15m) kubelet, aks-user-14930255-vmss000001 MountVolume.SetUp failed for volume "event-secret" : secret "events-archiver-secret" not found
@sgibson91 hmmm, I think that the message ...
kubelet, aks-user-14930255-vmss000001 MountVolume.SetUp failed for volume "event-secret" : failed to sync secret cache: timed out waiting for the condition
... means that kubelet which runs on every node I think, is failing to communicate with the k8s api-server to get information about the available Secrets in the k8s cluster which it wants to mount. If so, why does the kubelet of this node fail to do that?
- Hmm... Could it be that this node has a k8s software outdated in comparison to the k8s api-server? I believe that should be fine if its only 1 minor version mismatch, thats what I believe k8s plan to support.
Both client and server are on 1.16.10
- Hmmm... Could it be that the managed connection from this node to the k8s api-server has been influenced, for example by a firewall rule removed? With GKE, the api-server node is in another GCP project managed automatically by GCP, and they automatically create network peering between these networks as well as set up firewall rules to allow communications between them. If such firewall rule or network peering had been been disrupted - then I think you would see something like this.
I deployed the virtual network and subnet in the same resource group and haven't touched it since I first deployed this. I guess I could tear everything down and try again? Edited to add: I actually don't think the vnet gives you a firewall by default
- Hmmm, perhaps the node have entered some bad state that cause kubelet to fail to communicate with the k8s api-server? Perhaps a restart of the node/VM magically solves something?
Tried this but no magic 😢
@sgibson91 do you have a secret named events-archiver-secret
in the same namespace as the pod? If i look closer at the events related to the pod that you showed, most FailedMount errors were reported 15m ago, but the most recent successfully concluded that that secret was not found. That means I think that the thing that created the k8s event for the pod about failure to mount due to a secret not found actually got information about the secrets and concluded it wasn't there.
So, perhaps this is caused by a mix of flakey AKS and a missing secret?
kubectl get secret -n turing events-archiver-secret
I'm also very suspicious about the following, could you report back what comes back from running...
set +x
# are these kinds of pods around?
kubectl get pods -A | grep -I csi
# is there a "CSI Driver" (container storage interface)
kubectl get csidriver -o yaml
# hmmm what pods are around in kube-system btw on an AKS cluster?
kubectl get pods -n kube-system
No, there's no events-archiver-secret
returned by kubectl -n turing get secrets
, but it's defined here in the config: https://github.com/jupyterhub/mybinder.org-deploy/blob/cca0ec556907f9bb102739c40b5d9bfd648227bd/config/turing.yaml#L44-L54
Does it need to be manually added?
kubectl get pods -A | grep -I csi
produced no outputkubectl get csidriver -o yaml
Output:
apiVersion: v1
items: []
kind: List
metadata:
resourceVersion: ""
selfLink: ""
kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
azure-cni-networkmonitor-lf7q9 1/1 Running 1 8d
azure-cni-networkmonitor-p4k9s 1/1 Running 1 8d
azure-cni-networkmonitor-z7gcp 1/1 Running 1 8d
azure-ip-masq-agent-2wfzd 1/1 Running 1 8d
azure-ip-masq-agent-pkkdp 1/1 Running 1 8d
azure-ip-masq-agent-zqkhx 1/1 Running 1 8d
azure-npm-tnp6d 1/1 Running 1 5d22h
azure-npm-vffhf 1/1 Running 1 5d22h
azure-npm-wbdcx 1/1 Running 1 5d22h
coredns-869cb84759-b6kkm 1/1 Running 1 8d
coredns-869cb84759-f78cc 1/1 Running 1 8d
coredns-autoscaler-5b867494f-pqvgp 1/1 Running 1 8d
dashboard-metrics-scraper-566c858889-8c5rp 1/1 Running 1 8d
kube-proxy-9tzqs 1/1 Running 1 8d
kube-proxy-njqv6 1/1 Running 1 8d
kube-proxy-q27ql 1/1 Running 1 8d
kubernetes-dashboard-7f7d6bbd7f-5chb6 1/1 Running 1 8d
metrics-server-5f4c878d8-tstdh 1/1 Running 0 2d1h
tiller-deploy-64c6dd8d6b-8vzxq 1/1 Running 1 7d12h
tunnelfront-7cb79788bd-xcbws 1/1 Running 0 2d1h
@sgibson91 ah k8s reacts on the pod specifically trying to mount it. If it's not created as part of the Helm chart or similar, then its an issue.
I did a search in mybinder.org-deploy and only found it declared for use specifically in the turing deployment where you referenced it. I don't know what logic is injected into the turing binderhub pod that relies on the mounted GCP service account in /event-secret/service-account.json
but one should probably git blame to find when and why those configuration lines were added.
@sgibson91 ah k8s reacts on the pod specifically trying to mount it. If it's not created as part of the Helm chart or similar, then its an issue.
I did a search in mybinder.org-deploy and only found it declared for use specifically in the turing deployment where you referenced it. I don't know what logic is injected into the turing binderhub pod that relies on the mounted GCP service account in /event-secret/service-account.json
but one should probably git blame to find when and why those configuration lines were added.
This was the commit: https://github.com/jupyterhub/mybinder.org-deploy/commit/ab87dd2ce17451ccc3b85c0ee868b08968a50a50
This was the PR: https://github.com/jupyterhub/mybinder.org-deploy/pull/1339
Ah, so given we're also migrating GKE projects, this service account creation may need to be repeated. I scrubbed the old cluster because we had an unused node floating about that was being a pain to delete and I didn't want it soaking up money. Thanks for your help tracking this down ❤️
Wieee :)
Btw @sgibson91 @betatim and others, perhaps I could get a git-crypt key sent to me? While I'm not active in making dpeloyments (yet), its useful for my ability to debug and review if there is something to improve.
I can send it across to you if/when others +1 :)
Ah, so given we're also migrating GKE projects, this service account creation may need to be repeated.
The service account should still exist in the original GKE project. Once we complete the move to the new GKE Project we will have to recreate the service accounts and update the secrets in the OVH, Gesis and Turing clusters.
The service account is https://console.cloud.google.com/iam-admin/serviceaccounts/details/100212157396162800340?project=binder-prod but we didn't add the key associated with it to the repository. For the OVH equivalent we did (grep
for the key ID listed in the OVH stackdriver service account). So I think the next step is to add a new key to the service account, delete the old key, then add the JSON version of the key to the secrets/
directory and then setup the Secret
in the Turing cluster. For OVH the key is referenced from the chart so maybe we can copy that for the Turing setup as well.
Side thought. Is this service account for sending details to the logger that we lost access to because we thought we could ignore the email?
No, the analytics archive still works. What stopped working is streaming logs from all our pods to Stackdriver.
the events-archiver mount issue was a typo, which I also ran into in prod and fixed in #1624 (it was secrets not secret). #1620 updated the secret to send events to the new events-archive for the n3ew GKE project, so I think the only thing left to do for adding turing back to the federation is adding the DNS records for *.mybinder.turing.ac.uk
to point to the cluster's external ip.
Hooray! The upgrade worked! 🎉 I've now set the DNS records for *.mybinder.turing.ac.uk
to point to the cluster IP in Azure
PR to reinstate the Turing is here: https://github.com/jupyterhub/mybinder.org-deploy/pull/1637/
Wieee! :D Nice work @sgibson91!
And to you and @minrk !
Yay!
I also just noticed this comment about the cert-manager webhook from @consideRatio:
Also, i see the cert-manager webhook. I'd recommend disabling that
The cert-manager webhook is no longer optional in cert-manager >= 0.14, so disabling it is not an option. This is indeed part of what made upgrading cert-manager complicated because it's hardcoded in a few places that cert-manager is running in the cert-manager namespace. The way we worked around this limitation is to follow cert-manager's own guide that says cert-manager must not be a dependency of another chart, and installed on its own in the cert-manager namespace. Then everything works fine.
I think this may be due to the upgrade to k8s v1.16.9
Error message:
We can see that
extensions/v1beta1
has been deprecated here.Does anyone have any suggestions to try before I do
helm delete turing --purge
?Related: #1482 #1484