Closed matthew-brett closed 2 years ago
My guess is that your proxy-public
service didn't pass onward networking traffic fast enough compared to how fast enough the autohttps
pod started.
proxy:
service:
loadBalancerIP: 34.127.108.159
I can imagine a dirty hack that would for example add an init-container to the autohttps pod that slept for 10 seconds on first installation of the Helm chart. It would be a hack that wouldn't be relevant if you for example had a k8s Service that could react quicker.
I would prefer the idea of documenting that if something goes wrong for the autohttps pod, you should ensure that it can be reachable at the time it starts - and you may need to restart it it wasn't reachable at that point in time.
I would recommend that you inspect the k8s events for the autohttps pod that you can see by doing kubectl describe pod <autohttps pod name>
, as well as the proxy-public service (kubectl describe service proxy-public
). Maybe also kubectl get events
to see them all. If you can see for example that that the autohttps pod starts before an event where the proxy-public service successfully is coupled with the requested IP, then that's a clear signal that you had a race condition about that.
and you may need to restart it it wasn't reachable at that point in time.
Is this something other than the restart that resulted in the invariable "connection refused" error above?
Totally happy to have a post-setup workaround - but what would it be in this case?
you should ensure that it can be reachable at the time it starts - and you may need to restart it it wasn't reachable at that point in time.
In your case I mean that you should restart the autohttps
pod after you know that the proxy-public service is properly accepting traffic.
Whenever you work against Let's Encrypt, don't work against its production server directly, you can be blocked etc. I worry that you get don't get a full ACME challenge to happen. Use their staging server.
Increase the Traefik's log level (and other pods) by setting debug: true
in the JupyterHub Helm chart. That will help traefik emit more details on the process of acquiring a cert from LE.
You can use kubectl logs deploy/autohttps
to get one random pod associated with the deployment. Even more off topic is that you can use kubectl logs -l component=autohttps
if you had multiple replicas you want combined logs from.
https_pod=$(kubectl --namespace=jhub get pods -o custom-columns=POD:metadata.name | grep autohttps-)
kubectl --namespace=jhub logs pod/$https_pod traefik -f
Totally happy to have a post-setup workaround - but what would it be in this case?
I'm not sure if you mean a z2jh official post-setup workaround, or personal one. Overall I'm not happy to have one embedded in z2jh without a very clear comprehension of what is done and what is mitigated.
If you want to have a hack to make this function, look at this code relying on sufficient log levels, and https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/2431.
I'll look into this a bit again tomorrow, do your best to try self-answer as much follow up questions you may have!
How do I establish that the proxy-public service is properly accepting traffic? In testing so far, it doesn't seem to matter how long I wait to restart the autohttps pod, it always gives a "connection refused".
Is there a way of configuring which ACME server I am using?
How do I establish that the proxy-public service is properly accepting traffic? In testing so far, it doesn't seem to matter how long I wait to restart the autohttps pod, it always gives a "connection refused".
I don't know an easy way, but please inspect its associated k8s Events via kubectl describe service proxy-public
and provide such information - it may or may not have an event attached. It depends on GKE specific details on how its controller for LoadBalancer k8s Services works. I recall there is an descriptive event about it.
Is there a way of configuring which ACME server I am using?
With settings in config.yaml
:
proxy:
service:
loadBalancerIP: 34.127.108.159
https:
enabled: true
hosts:
- hub.odsti.org
letsencrypt:
contactEmail: matthew.brett@gmail.com
acmeServer: https://acme-staging-v02.api.letsencrypt.org/directory
debug:
enabled: true
I get a different error - going straight to 'connection refused' - log:
tp@internal\",\"rule\":\"PathPrefix(`/.well-known/acme-challenge/`)\",\"priority\":2147483647}},\"services\":{\"acme-http\":{},\"noop\":{}},\"serversTransports\":{\"default\":{\"maxIdleConnsPerHost\":200}}},\"tcp\":{},\"tls\":{}}" providerName=internal
time="2022-02-24T15:26:16Z" level=debug msg="Configuration received from provider default.acme: {\"http\":{},\"tls\":{}}" providerName=default.acme
time="2022-02-24T15:26:16Z" level=info msg="Starting provider *acme.ChallengeTLSALPN"
time="2022-02-24T15:26:16Z" level=debug msg="*acme.ChallengeTLSALPN provider configuration: {\"Timeout\":4000000000}"
time="2022-02-24T15:26:16Z" level=debug msg="No default certificate, generating one" tlsStoreName=default
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" routerName=insecure@file serviceName=default middlewareName=pipelining middlewareType=Pipelining entryPointName=http
time="2022-02-24T15:26:16Z" level=debug msg="Creating load-balancer" serviceName=default entryPointName=http routerName=insecure@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating server 0 http://proxy-http:8000/" entryPointName=http routerName=insecure@file serviceName=default serverName=0
time="2022-02-24T15:26:16Z" level=debug msg="child http://proxy-http:8000/ now UP"
time="2022-02-24T15:26:16Z" level=debug msg="Propagating new UP status"
time="2022-02-24T15:26:16Z" level=debug msg="Added outgoing tracing middleware default" routerName=insecure@file middlewareName=tracing middlewareType=TracingForwarder entryPointName=http
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" routerName=insecure@file middlewareName=redirect@file middlewareType=RedirectScheme entryPointName=http
time="2022-02-24T15:26:16Z" level=debug msg="Setting up redirection to https " entryPointName=http routerName=insecure@file middlewareName=redirect@file middlewareType=RedirectScheme
time="2022-02-24T15:26:16Z" level=debug msg="Adding tracing to middleware" entryPointName=http routerName=insecure@file middlewareName=redirect@file
time="2022-02-24T15:26:16Z" level=debug msg="Added outgoing tracing middleware acme-http@internal" entryPointName=http routerName=acme-http@internal middlewareName=tracing middlewareType=TracingForwarder
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" middlewareName=traefik-internal-recovery entryPointName=http middlewareType=Recovery
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https routerName=default@file serviceName=default middlewareType=Pipelining middlewareName=pipelining
time="2022-02-24T15:26:16Z" level=debug msg="Creating load-balancer" serviceName=default entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating server 0 http://proxy-http:8000/" serviceName=default serverName=0 entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="child http://proxy-http:8000/ now UP"
time="2022-02-24T15:26:16Z" level=debug msg="Propagating new UP status"
time="2022-02-24T15:26:16Z" level=debug msg="Added outgoing tracing middleware default" middlewareName=tracing middlewareType=TracingForwarder entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https routerName=default@file middlewareName=scheme@file middlewareType=Headers
time="2022-02-24T15:26:16Z" level=debug msg="Setting up customHeaders/Cors from {map[X-Scheme:https] map[] false [] [] [] [] [] 0 false [] [] false false map[] false 0 false false false false false false false}" routerName=default@file middlewareName=scheme@file middlewareType=Headers entryPointName=https
time="2022-02-24T15:26:16Z" level=debug msg="Adding tracing to middleware" entryPointName=https routerName=default@file middlewareName=scheme@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https routerName=default@file middlewareName=hsts@file middlewareType=Headers
time="2022-02-24T15:26:16Z" level=debug msg="Setting up secureHeaders from {map[] map[] false [] [] [] [] [] 0 false [] [] false false map[] false 15724800 false false false false false false false}" entryPointName=https routerName=default@file middlewareName=hsts@file middlewareType=Headers
time="2022-02-24T15:26:16Z" level=debug msg="Adding tracing to middleware" middlewareName=hsts@file entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https middlewareName=traefik-internal-recovery middlewareType=Recovery
time="2022-02-24T15:26:16Z" level=warning msg="No domain found in rule PathPrefix(`/`), the TLS options applied for this router will depend on the hostSNI of each request" entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="No default certificate, generating one" tlsStoreName=default
time="2022-02-24T15:26:16Z" level=debug msg="Looking for provided certificate(s) to validate [\"hub.odsti.org\"]..." providerName=default.acme ACME CA="https://acme-staging-v02.api.letsencrypt.org/directory"
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" routerName=insecure@file serviceName=default middlewareName=pipelining middlewareType=Pipelining entryPointName=http
time="2022-02-24T15:26:16Z" level=debug msg="Creating load-balancer" serviceName=default entryPointName=http routerName=insecure@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating server 0 http://proxy-http:8000/" entryPointName=http routerName=insecure@file serviceName=default serverName=0
time="2022-02-24T15:26:16Z" level=debug msg="Domains [\"hub.odsti.org\"] need ACME certificates generation for domains \"hub.odsti.org\"." ACME CA="https://acme-staging-v02.api.letsencrypt.org/directory" providerName=default.acme
time="2022-02-24T15:26:16Z" level=debug msg="Loading ACME certificates [hub.odsti.org]..." providerName=default.acme ACME CA="https://acme-staging-v02.api.letsencrypt.org/directory"
time="2022-02-24T15:26:16Z" level=debug msg="child http://proxy-http:8000/ now UP"
time="2022-02-24T15:26:16Z" level=debug msg="Propagating new UP status"
time="2022-02-24T15:26:16Z" level=debug msg="Added outgoing tracing middleware default" middlewareType=TracingForwarder entryPointName=http routerName=insecure@file middlewareName=tracing
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" middlewareType=RedirectScheme entryPointName=http routerName=insecure@file middlewareName=redirect@file
time="2022-02-24T15:26:16Z" level=debug msg="Setting up redirection to https " entryPointName=http routerName=insecure@file middlewareName=redirect@file middlewareType=RedirectScheme
time="2022-02-24T15:26:16Z" level=debug msg="Adding tracing to middleware" entryPointName=http routerName=insecure@file middlewareName=redirect@file
time="2022-02-24T15:26:16Z" level=debug msg="Added outgoing tracing middleware acme-http@internal" middlewareName=tracing middlewareType=TracingForwarder entryPointName=http routerName=acme-http@internal
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" middlewareName=traefik-internal-recovery middlewareType=Recovery entryPointName=http
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" routerName=default@file serviceName=default middlewareType=Pipelining middlewareName=pipelining entryPointName=https
time="2022-02-24T15:26:16Z" level=debug msg="Creating load-balancer" entryPointName=https routerName=default@file serviceName=default
time="2022-02-24T15:26:16Z" level=debug msg="Creating server 0 http://proxy-http:8000/" serviceName=default serverName=0 entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="child http://proxy-http:8000/ now UP"
time="2022-02-24T15:26:16Z" level=debug msg="Propagating new UP status"
time="2022-02-24T15:26:16Z" level=debug msg="Added outgoing tracing middleware default" middlewareName=tracing middlewareType=TracingForwarder entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https routerName=default@file middlewareName=scheme@file middlewareType=Headers
time="2022-02-24T15:26:16Z" level=debug msg="Setting up customHeaders/Cors from {map[X-Scheme:https] map[] false [] [] [] [] [] 0 false [] [] false false map[] false 0 false false false false false false false}" entryPointName=https routerName=default@file middlewareName=scheme@file middlewareType=Headers
time="2022-02-24T15:26:16Z" level=debug msg="Adding tracing to middleware" entryPointName=https routerName=default@file middlewareName=scheme@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https routerName=default@file middlewareType=Headers middlewareName=hsts@file
time="2022-02-24T15:26:16Z" level=debug msg="Setting up secureHeaders from {map[] map[] false [] [] [] [] [] 0 false [] [] false false map[] false 15724800 false false false false false false false}" middlewareName=hsts@file entryPointName=https routerName=default@file middlewareType=Headers
time="2022-02-24T15:26:16Z" level=debug msg="Adding tracing to middleware" entryPointName=https routerName=default@file middlewareName=hsts@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https middlewareName=traefik-internal-recovery middlewareType=Recovery
time="2022-02-24T15:26:16Z" level=warning msg="No domain found in rule PathPrefix(`/`), the TLS options applied for this router will depend on the hostSNI of each request" routerName=default@file entryPointName=https
time="2022-02-24T15:26:16Z" level=debug msg="Looking for provided certificate(s) to validate [\"hub.odsti.org\"]..." providerName=default.acme ACME CA="https://acme-staging-v02.api.letsencrypt.org/directory"
time="2022-02-24T15:26:16Z" level=debug msg="No ACME certificate generation required for domains [\"hub.odsti.org\"]." ACME CA="https://acme-staging-v02.api.letsencrypt.org/directory" providerName=default.acme
time="2022-02-24T15:26:19Z" level=debug msg="Building ACME client..." providerName=default.acme
time="2022-02-24T15:26:19Z" level=debug msg="https://acme-staging-v02.api.letsencrypt.org/directory" providerName=default.acme
time="2022-02-24T15:26:19Z" level=info msg=Register... providerName=default.acme
time="2022-02-24T15:26:19Z" level=debug msg="legolog: [INFO] acme: Registering account for matthew.brett@gmail.com"
time="2022-02-24T15:26:19Z" level=debug msg="Using HTTP Challenge provider." providerName=default.acme
time="2022-02-24T15:26:19Z" level=debug msg="legolog: [INFO] [hub.odsti.org] acme: Obtaining bundled SAN certificate"
time="2022-02-24T15:26:20Z" level=debug msg="legolog: [INFO] [hub.odsti.org] AuthURL: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/1763208468"
time="2022-02-24T15:26:20Z" level=debug msg="legolog: [INFO] [hub.odsti.org] acme: Could not find solver for: tls-alpn-01"
time="2022-02-24T15:26:20Z" level=debug msg="legolog: [INFO] [hub.odsti.org] acme: use http-01 solver"
time="2022-02-24T15:26:20Z" level=debug msg="legolog: [INFO] [hub.odsti.org] acme: Trying to solve HTTP-01"
time="2022-02-24T15:26:37Z" level=debug msg="legolog: [INFO] Deactivating auth: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/1763208468"
time="2022-02-24T15:26:37Z" level=error msg="Unable to obtain ACME certificate for domains \"hub.odsti.org\" : unable to generate a certificate for the domains [hub.odsti.org]: error: one or more domains had a problem:\n[hub.odsti.org] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://hub.odsti.org/.well-known/acme-challenge/s3AOq8SFHdAE2Z3AOhkIjV4gxQmZ2pdJSejLIRrAFJk: Connection refused\n" providerName=default.acme
On the run recorded above, with debug
and the staging server:
$ kubectl describe pod $http_pod > autohttps_describe.txt
$ kubectl describe service proxy-public > proxy_public_service.txt
$ kubectl get events > all_events.txt
These files all pushed to https://github.com/matthew-brett/littlest-jhub-k8s
Restarting the Autohttps 12 minutes after the cluster comes up gives the same final error of "Conection refused".
This seems like the initial attempt was doomed to fail, because the proxy-public service wasn't setup fast enough for the autohttps pod to successfully startup.
7m44s Normal EnsuringLoadBalancer service/proxy-public Ensuring load balancer
7m16s Normal Started pod/autohttps-5588fb5c97-v9px5 Started container traefik
7m8s Normal EnsuredLoadBalancer service/proxy-public Ensured load balancer
If you get a recurring error after restart, when "Ensured load balancer" has happened and I presume you will be reachable from the internet unless there are other issues, then it could in theory be related to a failed initial startup if state was stashed from the first start.
Traefik isn't storing the TLS cert it acquires into a k8s secret, but a local file. So, we have made a solution to help let that file be synced to a k8s Secret manually to keep state there. Perhaps such saved state is problematic?
Do you have such secret created? It should be named proxy-public-tls-acme
- also - this isn't managed by the helm chart, so it isn't cleared as part of deleting a helm chart.
If you have such secret, it would be nice if you could provide its content to be analyzed further.
No - I haven't seen that secret:
$ kubectl get secrets
NAME TYPE DATA AGE
autohttps-token-ztblb kubernetes.io/service-account-token 3 34m
default-token-kkg7f kubernetes.io/service-account-token 3 34m
hub Opaque 4 34m
hub-token-wcqqn kubernetes.io/service-account-token 3 34m
sh.helm.release.v1.jhub.v1 helm.sh/release.v1 1 34m
user-scheduler-token-9wdm6 kubernetes.io/service-account-token 3 34m
Okay, then you don't have a state that would interfere with things. Your initial failure was very likely been because of your proxy-public k8s Service wasn't "ready" as a LoadBalancer type - which requires setup by GCP.
Perhaps something blocks http traffic or similar or don't route traffic correctly. But, you said this was an intermittent failure i recall - in what way if so?
Intermittent in the sense that just after the upgrade to the Helm chart yesterday, with the Traefik update, the same cluster setup did acquire a certificate, in two teardown / rebuild cycles. It failed again to work in every test today, with the same configuration, and the same Helm chart.
Maybe you have a GCP firewall rule setup to block traffic...
I suggest that you verify that a k8s Service of kind LoadBalancer with a specific IP is actually functional to reach any dummy pod doing anything silly - like echo back web request.
in two teardown / rebuild cycles.
How deep teardown? helm delete ...
< helm delete ...
+ kubectl delete namespace ...
< Deleting the GKE cluster < Deleting the GCP project (includes firewall rules etc).
Teardown meaning deleting the GKE cluster. How would a GCP firewall rule come about - I guess it would have to come from something other than the gcloud
commands and the Helm chart from the default install?
I'm not confident on what to look for to ensure the proxy-public
service functions fully, but try learn that by inspecting kubectl get svc proxy-public -o yaml
and look in the status
section and at associated events via kubectl describe svc proxy-public
. The goal: to figure out if there is information given to you describing if it fully functions or not. Also note that k8s Events don't live forever (10 minutes?) so you must check before they time out.
Also, clarify if your GKE cluster was setup with a network policy controller like calico or not, and if so, what version of calico.
When you create a k8s Service resource with type LoadBalancer (proxy-public), GKE automatically does things you can observe in GCP - such as create firewall rules etc, and binds itself to the IP you have requested, etc.
I don't know, I have no clue what is wrong - but I'm confident you have a failure to connect between the Public Let's Encrypt servers and your autohttps pod running traefik, or traefik itself refused network connectivity. What I can't tell, so you have to check each step along the way.
Do your best to investigate this further on your own - I can't invest more time into this but I have such a hard time not answering questions while I want to keep track of other notifications on github =/
Yes, I understand. I posted here because I assumed that, because I had the absolute default setup, it would be relevant for others. Or at least, if I don't have the absolutely default setup, I do not understand why not :)
kubectl get svc proxy-public -o yaml
shows:
status:
loadBalancer:
ingress:
- ip: 34.127.108.159
Anyway - it sounds as if it would be better to wait to see if someone else runs across this - and they can contribute here.
In the meantime, I think I will have to give up, and back off to a Littlest JupyterHub setup.
And of course - I would predict that anyone cloning my repo and running those commands, will get the same effect.
I reproduced it. It seems to me that the success rate is almost 100% when you are pulling a new image in the autohttps pod, assuming proxy-public has a functional IP etc already.
I could reproduce it for traefik 2.4-2.6 versions. I figure this is about the CNI (what handles networking for containers) in GKE that has become slower or similar. Not sure. I'll try figure out if its the CNI, or perhaps if it is traefik that is slow to become ready.
At the moment I'm rate-limited by LE's staging server.
That's great that you could reproduce it - kind of! And sorry for taking your focus - it's kind of you to persist.
Hi guys, I solved this issue by deleting the autohttps pod, forcing a restart of the service. SSL is now enabled for my JupyterHub, where it threw an invalid SSL cert error before.
This post shows the approach: https://discourse.jupyter.org/t/trouble-getting-https-letsencrypt-working-with-0-9-0-beta-4/3583/5
For completeness:
Find the JupyterHub namespace: (You defined your namespace when you created the JupyterHub instance if you did it with Helm)
kubectl get namespace
kubectl get pods --namespace <your-namespace>
Now identify the pod that contains "autohttps" and copy the full name to your clipboard.
kubectl delete pods --namespace <your-namespace> <autohttps-podname>
After deletion, Kubernetes will automatically spin up a new autohttps pod. You can verify this by listing the pods again:
kubectl get pods --namespace <your-namespace>
Where the new autohttps pod should now have a different name than your original one.
This is tested and confirmed to work with JupyterHub Helm chart version 1.1.1 and the Kubernetes installation instructions from the official docs as of 2022-02-24.
EDIT: This was deployed and tested on GCP/GKE.
I think I've nailed it as I could make sure it didn't occur by introducing a delay from when the k8s Pod had been scheduled on a node and received an IP. Either by pulling a new image on a node, or by tweaking the startup command to first run sleep 10
before starting Traefik as usual.
I've proposed a new feature for Traefik in https://github.com/traefik/traefik/issues/8803. But for now, a workaround could be to redeploy by setting new image tags to force restarts of the autohttps pod. Example config:
proxy:
traefik:
image:
# tag modified to trigger a restart of the autohttps pod
# and induce a delay while downloading the image
# that ensures networking gets setup in time
# which allows the requested ACME challenge
# where the Pod will receive inbound network traffic
# can succeed.
tag: 2.6.0 # default is 2.6.1
Another option to this is to edit the autohttps deployment like this.
kubectl edit deploy autohttps
# ...
containers:
- image: traefik:v2.6.1
+ command: ["sh", "-c", "sleep 10 && /entrypoint.sh traefik"]
imagePullPolicy: IfNotPresent
# ...
This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:
https://discourse.jupyter.org/t/trouble-getting-https-letsencrypt-working-with-0-9-0-beta-4/3583/12
Thanks for really good suggestions, @consideRatio!
What do you think about programming in a restart of the autohttps pod after deployment instead? While it might not be a good solution for traefik users in general, it could be a decent solution for JupyterHub users. Looks like it could be programmed with Terraform or a similar service.
Would avoid both disadvantages you mentioned in the traefik github issue, wouldn't it?
Does pod deletion alternatively introduce potential problems/errors that I'm not aware of?
Please excuse all the questions here, however I think you're more well versed in kubernetes and traefik than I am!
What do you think about programming in a restart of the autohttps pod after deployment instead?
Such workaround is problematic for other reasons. For example, you can be rate-limited from Let's Encrypt if you repeatedly fail - say for example that you had forgot to configure the domain correctly. You don't want that to happen by mistake.
I think the only healthy choice is to not re-attempt, but perhaps also to wait a bit by default before making an attempt as it is a quite rare thing you need to do, so waiting a while isn't very problematic. And, if this wait can be removed by configuration, its even less problematic.
That sounds reasonable - a wait by default, that can be removed with a not-default configuration setting,
FWIW, we've had our prod cluster down for 24 hrs now largely due to this "race condition" on GKE, and this is the 3rd time in six months its cost me more than 6hrs. I should learn and remember how to hack around this, and of course I'll doc it this time, but yikes, very hard to debug even once you figure out how to turn debug logs on!
@snickell yeah it is a very complicated interaction to ensure we have fully functional. I've invested a lot of time to debug and fix issues. In z2jh that had a readiness probe, in traefik that had race conditions in its software, and now we have a race condition in the CNI of the k8s cluster that setups the networking vs the container startup.
I see it as a regression in GKE's CNI that is quite problematic as not being able to count on inbound traffic is supported as soon as the pod starts is a big downside. I think the best solution for z2jh is to have a configurable delay for traefik so that we don't risk making multiple attempts to LE and getting rate limited for a long time etc, while also giving us a good chance to start up properly right away.
@consideRatio - I see that the Traefik issue is closed now, in favor of other workarounds.
I assume the initContainers workaround suggested by Martin Jakobsson is the right one?
deployment:
initContainers:
- name: startup-delay
image: busybox:1.31.1
command: ['sh', '-c', 'sleep 5']
What is the cleanest way of applying the workaround?
@matthew-brett yepp I suggest anyone having trouble with this goes for an initContainer workaround.
proxy:
traefik:
extraInitContainers:
# This startup delay can help the k8s container network interface setup
# network routing of traffic to the pod, which is essential for the ACME
# challenge submitted by Traefik on startup to acquire a TLS certificate.
#
# Sleeping for 7 seconds has been consistently sufficient to avoid issues
# in GKE when this workaround was explored initially for GKE.
#
- name: startup-delay
image: busybox:stable
command: ["sh", "-c", "sleep 10"]
If this problem remains, it would be great to have this documented as well. I currently think of this as a bug in GKE's setup though, and hope its fixed over time. Please note that its relevant to know if its fixed or not fixed, so feel free to comment about that if you who reads this has an update about it, make sure to mention the GKE version and the Helm chart version if you experience a intermittent failure to acquire certs.
@consideRatio - thanks for this - as ever. I believe the extraInitContainers
fix above requires a very recent (devel) Helm chart - I've applied it.
Do you have any ideas on where this should be reported to the GKE folks?
Do you have any ideas on where this should be reported to the GKE folks?
@matthew-brett actually no, it would be super helpful to have a report about this to track about this. It may very well be some non-GKE specific component as well that leads to this, I'm not sure where though.
If anyone knows where to report this or have found a report already about it, please link it!
@matthew-brett I think the crux is that kube-proxy
is too slow to update iptables on nodes.
See https://cloud.google.com/kubernetes-engine/docs/concepts/network-overview#kube-proxy where its said that:
kube-proxy, which is not an in-line proxy, but an egress-based load-balancing controller, watches the Kubernetes API server and continually maps the ClusterIP to healthy Pods by adding and removing destination NAT (DNAT) rules to the node's iptables subsystem. When a container running in a Pod sends traffic to a Service's ClusterIP, the node selects a Pod at random and routes the traffic to that Pod.
Perhaps something was changed to make kube-proxy slower, either by GKE configuration or setup, or by kube-proxy itself.
kube-proxy
is documented here: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/
@matthew-brett hmmm actually this may be it.
I did kubectl get ds -o yaml kube-proxy
and observed the following.
- command:
- /bin/sh
- -c
- kube-proxy --cluster-cidr=10.72.0.0/14 --oom-score-adj=-998 --v=2 --feature-gates=DynamicKubeletConfig=false,InTreePluginAWSUnregister=true,InTreePluginAzureDiskUnregister=true,InTreePluginOpenStackUnregister=true,InTreePluginvSphereUnregister=true,PodDeletionCost=true
--iptables-sync-period=1m --iptables-min-sync-period=10s --ipvs-sync-period=1m
--ipvs-min-sync-period=10s --detect-local-mode=NodeCIDR 1>>/var/log/kube-proxy.log
2>&1
--iptables-sync-period=1m (Default: 30s)
--iptables-min-sync-period=10s (Default: 1s)
--ipvs-sync-period=1m (Default: 30s)
--ipvs-min-sync-period=10s (no default documented)
I think these settings, if tweaked, could also fix the problem. I'd love to know more if it does, but I can't invest time to look into it atm.
@consideRatio thanks so much for your debugging here--I had this issue still with a Jupyterhub for Kubernetes deployment on GKE today. Same issue (looking at the traefik pod logs, we just weren't able to connect to the acme challenge, so it would never issue a certificate). Spent about 9 hours debugging it even after adding your fix in; turns out, something changed recently where 7 seconds isn't enough. Bumped it up to 20 seconds and the issues stopped.
Thank you for following up about this @Hansenq!!
If you have time and energy left, I'm curious about:
Yup, that's my workaround! The issue I was facing was this (from the traefik pod):
time="2024-01-03T04:34:38Z" level=debug msg="Using HTTP Challenge provider." providerName=default.acme
time="2024-01-03T04:34:38Z" level=debug msg="legolog: [INFO] [jupyterhub-test.lightski.com] acme: Obtaining bundled SAN certificate"
time="2024-01-03T04:34:38Z" level=debug msg="legolog: [INFO] [jupyterhub-test.lightski.com] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/300212319836"
time="2024-01-03T04:34:38Z" level=debug msg="legolog: [INFO] [jupyterhub-test.lightski.com] acme: Could not find solver for: tls-alpn-01"
time="2024-01-03T04:34:38Z" level=debug msg="legolog: [INFO] [jupyterhub-test.lightski.com] acme: use http-01 solver"
time="2024-01-03T04:34:38Z" level=debug msg="legolog: [INFO] [jupyterhub-test.lightski.com] acme: Trying to solve HTTP-01"
time="2024-01-03T04:34:42Z" level=debug msg="legolog: [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/300212319836"
time="2024-01-03T04:34:42Z" level=error msg="Unable to obtain ACME certificate for domains \"jupyterhub-test.lightski.com\"" ACME CA="https://acme-v02.api.letsencrypt.org/directory" routerName=default@file rule="PathPrefix(`/`)" error="unable to generate a certificate for the domains [jupyterhub-test.lightski.com]: error: one or more domains had a problem:\n[j
upyterhub-test.lightski.com] acme: error: 400 :: urn:ietf:params:acme:error:connection :: 34.168.177.244: Fetching http://jupyterhub-test.lightski.com/.well-known/acme-challenge/aRNWnJaar79ufFNeAbcpyMTTCfW2FDlBXdaSEOSbGhE: Connection refused\n" providerName=default.acme
time="2024-01-03T04:35:00Z" level=debug msg="Unable to split host and port: address jupyterhub-test.lightski.com: missing port in address. Fallback to request host." providerName=acme
time="2024-01-03T04:35:00Z" level=debug msg="Retrieving the ACME challenge for jupyterhub-test.lightski.com (token \"aRNWnJaar79ufFNeAbcpyMTTCfW2FDlBXdaSEOSbGhE\")..." providerName=acme
time="2024-01-03T04:35:00Z" level=error msg="Error getting challenge for token retrying in 723.13307ms" providerName=acme
time="2024-01-03T04:35:00Z" level=error msg="Error getting challenge for token retrying in 792.620258ms" providerName=acme
time="2024-01-03T04:35:01Z" level=error msg="Error getting challenge for token retrying in 590.054227ms" providerName=acme
time="2024-01-03T04:35:02Z" level=error msg="Error getting challenge for token retrying in 1.394355836s" providerName=acme
time="2024-01-03T04:35:03Z" level=debug msg="Unable to split host and port: address jupyterhub-test.lightski.com: missing port in address. Fallback to request host." providerName=acme
time="2024-01-03T04:35:03Z" level=debug msg="Retrieving the ACME challenge for jupyterhub-test.lightski.com (token \"aRNWnJaar79ufFNeAbcpyMTTCfW2FDlBXdaSEOSbGhE\")..." providerName=acme
To be completely fair, I tried about 10 different ways to get it to work (including getting rid of my custom baseUrl
and hub_routespec
which I thought were causing problems, trying different domains, etc.). I'm 90% sure it's the 20 seconds that got it to work.
The other potential alternative is that traefik was failing with a host with a subdomain (e.g. jupyter-test.lightski.com
), and as soon as I added a host without a subdomain (e.g. myothersubdomain.com
) it started to work. I'm pretty sure it's not the latter since that...just doesn't seem right, but I was a bit tired from all the debugging when it finally started to work.
Client Version: v1.28.4, Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3, Server Version: v1.28.3-gke.1203001
Not sure I have that
[22:22:44] CS/lightski(main)%> kubectl get ds --namespace=jupyterhub-prod
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
continuous-image-puller 2 2 2 2 2 <none> 10h
hook-image-puller 2 2 2 2 2 <none> 87m
[22:22:51] CS/lightski(main)%> kubectl get ds kube-proxy --namespace=jupyterhub-prod
Error from server (NotFound): daemonsets.apps "kube-proxy" not found
Thank you for taking the time to follow this up so thoroughly!!
You can find the kube-proxy daemonset in the kube-system namespace (its not used with dataplane v2, but you should see it there as dataplane v2 was disabled) so:
kubectl get ds -o yaml kube-proxy -n kube-system
oh no, same to you, for a 1.5 year old issue! You were right; here's the relevant section of the yaml (let me know if you want the whole thing):
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "1"
creationTimestamp: "2024-01-02T19:16:04Z"
generation: 1
labels:
addonmanager.kubernetes.io/mode: Reconcile
k8s-app: kube-proxy
name: kube-proxy
namespace: kube-system
resourceVersion: "926"
uid: 40286d34-e3d3-4cff-a8c4-a68c802de62d
spec:
revisionHistoryLimit: 10
template:
spec:
containers:
- command:
- /bin/sh
- -c
- kube-proxy --cluster-cidr=10.124.0.0/14 --oom-score-adj=-998 --v=2 --feature-gates=InTreePluginAWSUnregister=true,InTreePluginAzureDiskUnregister=true,InTreePluginvSphereUnregister=true,DisableKubeletCloudCredentialProviders=true,UnauthenticatedHTTP2DOSMitigation=true
--iptables-sync-period=1m --iptables-min-sync-period=10s --ipvs-sync-period=1m
--ipvs-min-sync-period=10s --detect-local-mode=NodeCIDR 1>>/var/log/kube-proxy.log
2>&1
env:
- name: KUBERNETES_SERVICE_HOST
value: 10.138.0.9
image: gke.gcr.io/kube-proxy-amd64:v1.28.3-gke.1203001
image: gke.gcr.io/kube-proxy-amd64:v1.28.3-gke.1203001
imagePullPolicy: IfNotPresent
name: kube-proxy
My JupyterHub / Kubernetes cluster is failing to acquire LetsEncrypt certificates for HTTPS.
I am working with an absolutely default z2jh / GKE setup generated from copy-paste of the current install instructions at:
https://zero-to-jupyterhub.readthedocs.io/en/latest/kubernetes/google/step-zero-gcp.html
then
https://zero-to-jupyterhub.readthedocs.io/en/latest/jupyterhub/installation.html
followed by absolute default LetsEncrypt setup at:
https://tljh.jupyter.org/en/latest/howto/admin/https.html
The exact scripts are in https://github.com/matthew-brett/littlest-jhub-k8s. The README shows the steps to set up the cluster, basically just running scripts with the default instructions copy-pasted.
config.yaml
is:I set the DNS A record about 24 hours ago, and
nslookup hub.odsti.org
goes correctly give me34.127.108.159
, and has done for more than 12 hours.I have not modified the default GKE setup in any way -
gcloud container clusters describe jhub-cluster --zone=us-west1-b
shows me:I have no ingress controller - because I have the default setup, and I have no readinessProbe configured.
I'm using the latest Helm chart at time of writing :
1.1.3-n324.hb93a9b5f
, but have also tried1.2.0
with the same result.What I am seeing, is errors in the
autohttps
pod log.gives:
If I restart the autohttps pod:
Then get the log for the new pod:
I get:
kubectl --namespace=jhub get secrets
shows noproxy-public-tls-acme
secret - see this thread.Once, or twice, these exact same commands did give a certificate, and HTTPS access, and did not give errors in the log - so this appears to be a race condition of some sort, but I am lost as to how to debug further.
There is some discussion of this problem on Gitter, starting at https://gitter.im/jupyterhub/jupyterhub?at=621654ca0909252318f3d7dc.