LetsEncrypt certificate generation failing on basic default z2jh / GKE setup

matthew-brett commented 2 years ago

My JupyterHub / Kubernetes cluster is failing to acquire LetsEncrypt certificates for HTTPS.

I am working with an absolutely default z2jh / GKE setup generated from copy-paste of the current install instructions at:

https://zero-to-jupyterhub.readthedocs.io/en/latest/kubernetes/google/step-zero-gcp.html

then

https://zero-to-jupyterhub.readthedocs.io/en/latest/jupyterhub/installation.html

followed by absolute default LetsEncrypt setup at:

https://tljh.jupyter.org/en/latest/howto/admin/https.html

The exact scripts are in https://github.com/matthew-brett/littlest-jhub-k8s. The README shows the steps to set up the cluster, basically just running scripts with the default instructions copy-pasted.

config.yaml is:

proxy:
  service:
    loadBalancerIP: 34.127.108.159
  https:
    enabled: true
    hosts:
      - hub.odsti.org
    letsencrypt:
      contactEmail: matthew.brett@gmail.com

I set the DNS A record about 24 hours ago, and nslookup hub.odsti.org goes correctly give me 34.127.108.159, and has done for more than 12 hours.

I have not modified the default GKE setup in any way - gcloud container clusters describe jhub-cluster --zone=us-west1-b shows me:

networkPolicyConfig:
    disabled: true

I have no ingress controller - because I have the default setup, and I have no readinessProbe configured.

I'm using the latest Helm chart at time of writing : 1.1.3-n324.hb93a9b5f, but have also tried 1.2.0 with the same result.

What I am seeing, is errors in the autohttps pod log.

https_pod=$(kubectl --namespace=jhub get pods -o custom-columns=POD:metadata.name | grep autohttps-)
kubectl --namespace=jhub logs pod/$https_pod traefik -f

gives:

Context "gke_nipraxis-jupyterhub_us-west1-b_jhub-cluster" modified.
time="2022-02-24T12:37:19Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.yaml"
time="2022-02-24T12:37:20Z" level=warning msg="No domain found in rule PathPrefix(`/`), the TLS options applied for this router will depend on the hostSNI of each request" entryPointName=https routerName=default@file
time="2022-02-24T12:37:20Z" level=warning msg="No domain found in rule PathPrefix(`/`), the TLS options applied for this router will depend on the hostSNI of each request" entryPointName=https routerName=default@file
time="2022-02-24T12:37:37Z" level=error msg="Unable to obtain ACME certificate for domains \"hub.odsti.org\" : unable to generate a certificate for the domains [hub.odsti.org]: error: one or more domains had a problem:\n[hub.odsti.org] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://hub.odsti.org/.well-known/acme-challenge/zDZVtZBCtd7wqqdzxyHmmxK9wxSElv8AFX0BdqSuP2g: Timeout during connect (likely firewall problem)\n" providerName=default.acme

If I restart the autohttps pod:

kubectl --namespace=jhub delete pods $https_pod

Then get the log for the new pod:

https_pod=$(kubectl --namespace=jhub get pods -o custom-columns=POD:metadata.name | grep autohttps-)
kubectl --namespace=jhub logs pod/$https_pod traefik -f

I get:

kubectl --namespace=jhub logs pod/$https_pod traefik -f
time="2022-02-24T12:40:34Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.yaml"
time="2022-02-24T12:40:34Z" level=warning msg="No domain found in rule PathPrefix(`/`), the TLS options applied for this router will depend on the hostSNI of each request" entryPointName=https routerName=default@file
time="2022-02-24T12:40:35Z" level=warning msg="No domain found in rule PathPrefix(`/`), the TLS options applied for this router will depend on the hostSNI of each request" routerName=default@file entryPointName=https
time="2022-02-24T12:40:44Z" level=error msg="Unable to obtain ACME certificate for domains \"hub.odsti.org\" : unable to generate a certificate for the domains [hub.odsti.org]: error: one or more domains had a problem:\n[hub.odsti.org] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://hub.odsti.org/.well-known/acme-challenge/lihtvdnVUjObuswoOnhSMKZ7VucdaZ-g_ffcQklznsw: Connection refused\n" providerName=default.acme

kubectl --namespace=jhub get secrets shows no proxy-public-tls-acme secret - see this thread.

Once, or twice, these exact same commands did give a certificate, and HTTPS access, and did not give errors in the log - so this appears to be a race condition of some sort, but I am lost as to how to debug further.

There is some discussion of this problem on Gitter, starting at https://gitter.im/jupyterhub/jupyterhub?at=621654ca0909252318f3d7dc.

consideRatio commented 2 years ago

My guess is that your proxy-public service didn't pass onward networking traffic fast enough compared to how fast enough the autohttps pod started.

proxy:
  service:
    loadBalancerIP: 34.127.108.159

I can imagine a dirty hack that would for example add an init-container to the autohttps pod that slept for 10 seconds on first installation of the Helm chart. It would be a hack that wouldn't be relevant if you for example had a k8s Service that could react quicker.

I would prefer the idea of documenting that if something goes wrong for the autohttps pod, you should ensure that it can be reachable at the time it starts - and you may need to restart it it wasn't reachable at that point in time.

I would recommend that you inspect the k8s events for the autohttps pod that you can see by doing kubectl describe pod <autohttps pod name>, as well as the proxy-public service (kubectl describe service proxy-public). Maybe also kubectl get events to see them all. If you can see for example that that the autohttps pod starts before an event where the proxy-public service successfully is coupled with the requested IP, then that's a clear signal that you had a race condition about that.

matthew-brett commented 2 years ago

and you may need to restart it it wasn't reachable at that point in time.

Is this something other than the restart that resulted in the invariable "connection refused" error above?

Totally happy to have a post-setup workaround - but what would it be in this case?

consideRatio commented 2 years ago

you should ensure that it can be reachable at the time it starts - and you may need to restart it it wasn't reachable at that point in time.

In your case I mean that you should restart the autohttps pod after you know that the proxy-public service is properly accepting traffic.

What ACME server is used?

Whenever you work against Let's Encrypt, don't work against its production server directly, you can be blocked etc. I worry that you get don't get a full ACME challenge to happen. Use their staging server.

Increase the loglevel

Increase the Traefik's log level (and other pods) by setting debug: true in the JupyterHub Helm chart. That will help traefik emit more details on the process of acquiring a cert from LE.

Off topic tech tip

You can use kubectl logs deploy/autohttps to get one random pod associated with the deployment. Even more off topic is that you can use kubectl logs -l component=autohttps if you had multiple replicas you want combined logs from.

https_pod=$(kubectl --namespace=jhub get pods -o custom-columns=POD:metadata.name | grep autohttps-)
kubectl --namespace=jhub logs pod/$https_pod traefik -f

Workaround?

Totally happy to have a post-setup workaround - but what would it be in this case?

I'm not sure if you mean a z2jh official post-setup workaround, or personal one. Overall I'm not happy to have one embedded in z2jh without a very clear comprehension of what is done and what is mitigated.

If you want to have a hack to make this function, look at this code relying on sufficient log levels, and https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/2431.

Focus switch

I'll look into this a bit again tomorrow, do your best to try self-answer as much follow up questions you may have!

matthew-brett commented 2 years ago

How do I establish that the proxy-public service is properly accepting traffic? In testing so far, it doesn't seem to matter how long I wait to restart the autohttps pod, it always gives a "connection refused".

Is there a way of configuring which ACME server I am using?

consideRatio commented 2 years ago

How do I establish that the proxy-public service is properly accepting traffic? In testing so far, it doesn't seem to matter how long I wait to restart the autohttps pod, it always gives a "connection refused".

I don't know an easy way, but please inspect its associated k8s Events via kubectl describe service proxy-public and provide such information - it may or may not have an event attached. It depends on GKE specific details on how its controller for LoadBalancer k8s Services works. I recall there is an descriptive event about it.

Is there a way of configuring which ACME server I am using?

https://zero-to-jupyterhub.readthedocs.io/en/latest/resources/reference.html?highlight=staging#proxy-https-letsencrypt-acmeserver

matthew-brett commented 2 years ago

With settings in config.yaml:

proxy:
  service:
    loadBalancerIP: 34.127.108.159
  https:
    enabled: true
    hosts:
      - hub.odsti.org
    letsencrypt:
      contactEmail: matthew.brett@gmail.com
      acmeServer: https://acme-staging-v02.api.letsencrypt.org/directory

debug:
  enabled: true

I get a different error - going straight to 'connection refused' - log:

tp@internal\",\"rule\":\"PathPrefix(`/.well-known/acme-challenge/`)\",\"priority\":2147483647}},\"services\":{\"acme-http\":{},\"noop\":{}},\"serversTransports\":{\"default\":{\"maxIdleConnsPerHost\":200}}},\"tcp\":{},\"tls\":{}}" providerName=internal
time="2022-02-24T15:26:16Z" level=debug msg="Configuration received from provider default.acme: {\"http\":{},\"tls\":{}}" providerName=default.acme
time="2022-02-24T15:26:16Z" level=info msg="Starting provider *acme.ChallengeTLSALPN"
time="2022-02-24T15:26:16Z" level=debug msg="*acme.ChallengeTLSALPN provider configuration: {\"Timeout\":4000000000}"
time="2022-02-24T15:26:16Z" level=debug msg="No default certificate, generating one" tlsStoreName=default
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" routerName=insecure@file serviceName=default middlewareName=pipelining middlewareType=Pipelining entryPointName=http
time="2022-02-24T15:26:16Z" level=debug msg="Creating load-balancer" serviceName=default entryPointName=http routerName=insecure@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating server 0 http://proxy-http:8000/" entryPointName=http routerName=insecure@file serviceName=default serverName=0
time="2022-02-24T15:26:16Z" level=debug msg="child http://proxy-http:8000/ now UP"
time="2022-02-24T15:26:16Z" level=debug msg="Propagating new UP status"
time="2022-02-24T15:26:16Z" level=debug msg="Added outgoing tracing middleware default" routerName=insecure@file middlewareName=tracing middlewareType=TracingForwarder entryPointName=http
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" routerName=insecure@file middlewareName=redirect@file middlewareType=RedirectScheme entryPointName=http
time="2022-02-24T15:26:16Z" level=debug msg="Setting up redirection to https " entryPointName=http routerName=insecure@file middlewareName=redirect@file middlewareType=RedirectScheme
time="2022-02-24T15:26:16Z" level=debug msg="Adding tracing to middleware" entryPointName=http routerName=insecure@file middlewareName=redirect@file
time="2022-02-24T15:26:16Z" level=debug msg="Added outgoing tracing middleware acme-http@internal" entryPointName=http routerName=acme-http@internal middlewareName=tracing middlewareType=TracingForwarder
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" middlewareName=traefik-internal-recovery entryPointName=http middlewareType=Recovery
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https routerName=default@file serviceName=default middlewareType=Pipelining middlewareName=pipelining
time="2022-02-24T15:26:16Z" level=debug msg="Creating load-balancer" serviceName=default entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating server 0 http://proxy-http:8000/" serviceName=default serverName=0 entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="child http://proxy-http:8000/ now UP"
time="2022-02-24T15:26:16Z" level=debug msg="Propagating new UP status"
time="2022-02-24T15:26:16Z" level=debug msg="Added outgoing tracing middleware default" middlewareName=tracing middlewareType=TracingForwarder entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https routerName=default@file middlewareName=scheme@file middlewareType=Headers
time="2022-02-24T15:26:16Z" level=debug msg="Setting up customHeaders/Cors from {map[X-Scheme:https] map[] false [] [] [] [] [] 0 false [] [] false false  map[] false 0 false false false false  false false       false}" routerName=default@file middlewareName=scheme@file middlewareType=Headers entryPointName=https
time="2022-02-24T15:26:16Z" level=debug msg="Adding tracing to middleware" entryPointName=https routerName=default@file middlewareName=scheme@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https routerName=default@file middlewareName=hsts@file middlewareType=Headers
time="2022-02-24T15:26:16Z" level=debug msg="Setting up secureHeaders from {map[] map[] false [] [] [] [] [] 0 false [] [] false false  map[] false 15724800 false false false false  false false       false}" entryPointName=https routerName=default@file middlewareName=hsts@file middlewareType=Headers
time="2022-02-24T15:26:16Z" level=debug msg="Adding tracing to middleware" middlewareName=hsts@file entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https middlewareName=traefik-internal-recovery middlewareType=Recovery
time="2022-02-24T15:26:16Z" level=warning msg="No domain found in rule PathPrefix(`/`), the TLS options applied for this router will depend on the hostSNI of each request" entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="No default certificate, generating one" tlsStoreName=default
time="2022-02-24T15:26:16Z" level=debug msg="Looking for provided certificate(s) to validate [\"hub.odsti.org\"]..." providerName=default.acme ACME CA="https://acme-staging-v02.api.letsencrypt.org/directory"
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" routerName=insecure@file serviceName=default middlewareName=pipelining middlewareType=Pipelining entryPointName=http
time="2022-02-24T15:26:16Z" level=debug msg="Creating load-balancer" serviceName=default entryPointName=http routerName=insecure@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating server 0 http://proxy-http:8000/" entryPointName=http routerName=insecure@file serviceName=default serverName=0
time="2022-02-24T15:26:16Z" level=debug msg="Domains [\"hub.odsti.org\"] need ACME certificates generation for domains \"hub.odsti.org\"." ACME CA="https://acme-staging-v02.api.letsencrypt.org/directory" providerName=default.acme
time="2022-02-24T15:26:16Z" level=debug msg="Loading ACME certificates [hub.odsti.org]..." providerName=default.acme ACME CA="https://acme-staging-v02.api.letsencrypt.org/directory"
time="2022-02-24T15:26:16Z" level=debug msg="child http://proxy-http:8000/ now UP"
time="2022-02-24T15:26:16Z" level=debug msg="Propagating new UP status"
time="2022-02-24T15:26:16Z" level=debug msg="Added outgoing tracing middleware default" middlewareType=TracingForwarder entryPointName=http routerName=insecure@file middlewareName=tracing
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" middlewareType=RedirectScheme entryPointName=http routerName=insecure@file middlewareName=redirect@file
time="2022-02-24T15:26:16Z" level=debug msg="Setting up redirection to https " entryPointName=http routerName=insecure@file middlewareName=redirect@file middlewareType=RedirectScheme
time="2022-02-24T15:26:16Z" level=debug msg="Adding tracing to middleware" entryPointName=http routerName=insecure@file middlewareName=redirect@file
time="2022-02-24T15:26:16Z" level=debug msg="Added outgoing tracing middleware acme-http@internal" middlewareName=tracing middlewareType=TracingForwarder entryPointName=http routerName=acme-http@internal
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" middlewareName=traefik-internal-recovery middlewareType=Recovery entryPointName=http
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" routerName=default@file serviceName=default middlewareType=Pipelining middlewareName=pipelining entryPointName=https
time="2022-02-24T15:26:16Z" level=debug msg="Creating load-balancer" entryPointName=https routerName=default@file serviceName=default
time="2022-02-24T15:26:16Z" level=debug msg="Creating server 0 http://proxy-http:8000/" serviceName=default serverName=0 entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="child http://proxy-http:8000/ now UP"
time="2022-02-24T15:26:16Z" level=debug msg="Propagating new UP status"
time="2022-02-24T15:26:16Z" level=debug msg="Added outgoing tracing middleware default" middlewareName=tracing middlewareType=TracingForwarder entryPointName=https routerName=default@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https routerName=default@file middlewareName=scheme@file middlewareType=Headers
time="2022-02-24T15:26:16Z" level=debug msg="Setting up customHeaders/Cors from {map[X-Scheme:https] map[] false [] [] [] [] [] 0 false [] [] false false  map[] false 0 false false false false  false false       false}" entryPointName=https routerName=default@file middlewareName=scheme@file middlewareType=Headers
time="2022-02-24T15:26:16Z" level=debug msg="Adding tracing to middleware" entryPointName=https routerName=default@file middlewareName=scheme@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https routerName=default@file middlewareType=Headers middlewareName=hsts@file
time="2022-02-24T15:26:16Z" level=debug msg="Setting up secureHeaders from {map[] map[] false [] [] [] [] [] 0 false [] [] false false  map[] false 15724800 false false false false  false false       false}" middlewareName=hsts@file entryPointName=https routerName=default@file middlewareType=Headers
time="2022-02-24T15:26:16Z" level=debug msg="Adding tracing to middleware" entryPointName=https routerName=default@file middlewareName=hsts@file
time="2022-02-24T15:26:16Z" level=debug msg="Creating middleware" entryPointName=https middlewareName=traefik-internal-recovery middlewareType=Recovery
time="2022-02-24T15:26:16Z" level=warning msg="No domain found in rule PathPrefix(`/`), the TLS options applied for this router will depend on the hostSNI of each request" routerName=default@file entryPointName=https
time="2022-02-24T15:26:16Z" level=debug msg="Looking for provided certificate(s) to validate [\"hub.odsti.org\"]..." providerName=default.acme ACME CA="https://acme-staging-v02.api.letsencrypt.org/directory"
time="2022-02-24T15:26:16Z" level=debug msg="No ACME certificate generation required for domains [\"hub.odsti.org\"]." ACME CA="https://acme-staging-v02.api.letsencrypt.org/directory" providerName=default.acme
time="2022-02-24T15:26:19Z" level=debug msg="Building ACME client..." providerName=default.acme
time="2022-02-24T15:26:19Z" level=debug msg="https://acme-staging-v02.api.letsencrypt.org/directory" providerName=default.acme
time="2022-02-24T15:26:19Z" level=info msg=Register... providerName=default.acme
time="2022-02-24T15:26:19Z" level=debug msg="legolog: [INFO] acme: Registering account for matthew.brett@gmail.com"
time="2022-02-24T15:26:19Z" level=debug msg="Using HTTP Challenge provider." providerName=default.acme
time="2022-02-24T15:26:19Z" level=debug msg="legolog: [INFO] [hub.odsti.org] acme: Obtaining bundled SAN certificate"
time="2022-02-24T15:26:20Z" level=debug msg="legolog: [INFO] [hub.odsti.org] AuthURL: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/1763208468"
time="2022-02-24T15:26:20Z" level=debug msg="legolog: [INFO] [hub.odsti.org] acme: Could not find solver for: tls-alpn-01"
time="2022-02-24T15:26:20Z" level=debug msg="legolog: [INFO] [hub.odsti.org] acme: use http-01 solver"
time="2022-02-24T15:26:20Z" level=debug msg="legolog: [INFO] [hub.odsti.org] acme: Trying to solve HTTP-01"
time="2022-02-24T15:26:37Z" level=debug msg="legolog: [INFO] Deactivating auth: https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/1763208468"
time="2022-02-24T15:26:37Z" level=error msg="Unable to obtain ACME certificate for domains \"hub.odsti.org\" : unable to generate a certificate for the domains [hub.odsti.org]: error: one or more domains had a problem:\n[hub.odsti.org] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://hub.odsti.org/.well-known/acme-challenge/s3AOq8SFHdAE2Z3AOhkIjV4gxQmZ2pdJSejLIRrAFJk: Connection refused\n" providerName=default.acme

matthew-brett commented 2 years ago

On the run recorded above, with debug and the staging server:

$ kubectl describe pod $http_pod > autohttps_describe.txt
$ kubectl describe service proxy-public > proxy_public_service.txt
$ kubectl get events > all_events.txt

These files all pushed to https://github.com/matthew-brett/littlest-jhub-k8s

matthew-brett commented 2 years ago

Restarting the Autohttps 12 minutes after the cluster comes up gives the same final error of "Conection refused".

consideRatio commented 2 years ago

This seems like the initial attempt was doomed to fail, because the proxy-public service wasn't setup fast enough for the autohttps pod to successfully startup.

7m44s       Normal    EnsuringLoadBalancer     service/proxy-public                   Ensuring load balancer
7m16s       Normal    Started                  pod/autohttps-5588fb5c97-v9px5         Started container traefik
7m8s        Normal    EnsuredLoadBalancer      service/proxy-public                   Ensured load balancer

If you get a recurring error after restart, when "Ensured load balancer" has happened and I presume you will be reachable from the internet unless there are other issues, then it could in theory be related to a failed initial startup if state was stashed from the first start.

Traefik isn't storing the TLS cert it acquires into a k8s secret, but a local file. So, we have made a solution to help let that file be synced to a k8s Secret manually to keep state there. Perhaps such saved state is problematic?

Do you have such secret created? It should be named proxy-public-tls-acme - also - this isn't managed by the helm chart, so it isn't cleared as part of deleting a helm chart.

If you have such secret, it would be nice if you could provide its content to be analyzed further.

matthew-brett commented 2 years ago

No - I haven't seen that secret:

$ kubectl get secrets
NAME                         TYPE                                  DATA   AGE
autohttps-token-ztblb        kubernetes.io/service-account-token   3      34m
default-token-kkg7f          kubernetes.io/service-account-token   3      34m
hub                          Opaque                                4      34m
hub-token-wcqqn              kubernetes.io/service-account-token   3      34m
sh.helm.release.v1.jhub.v1   helm.sh/release.v1                    1      34m
user-scheduler-token-9wdm6   kubernetes.io/service-account-token   3      34m

consideRatio commented 2 years ago

Okay, then you don't have a state that would interfere with things. Your initial failure was very likely been because of your proxy-public k8s Service wasn't "ready" as a LoadBalancer type - which requires setup by GCP.

Perhaps something blocks http traffic or similar or don't route traffic correctly. But, you said this was an intermittent failure i recall - in what way if so?

matthew-brett commented 2 years ago

Intermittent in the sense that just after the upgrade to the Helm chart yesterday, with the Traefik update, the same cluster setup did acquire a certificate, in two teardown / rebuild cycles. It failed again to work in every test today, with the same configuration, and the same Helm chart.

consideRatio commented 2 years ago

Maybe you have a GCP firewall rule setup to block traffic...

I suggest that you verify that a k8s Service of kind LoadBalancer with a specific IP is actually functional to reach any dummy pod doing anything silly - like echo back web request.

in two teardown / rebuild cycles.

How deep teardown? helm delete ... < helm delete ... + kubectl delete namespace ... < Deleting the GKE cluster < Deleting the GCP project (includes firewall rules etc).

matthew-brett commented 2 years ago

Teardown meaning deleting the GKE cluster. How would a GCP firewall rule come about - I guess it would have to come from something other than the gcloud commands and the Helm chart from the default install?

consideRatio commented 2 years ago

I'm not confident on what to look for to ensure the proxy-public service functions fully, but try learn that by inspecting kubectl get svc proxy-public -o yaml and look in the status section and at associated events via kubectl describe svc proxy-public. The goal: to figure out if there is information given to you describing if it fully functions or not. Also note that k8s Events don't live forever (10 minutes?) so you must check before they time out.

Also, clarify if your GKE cluster was setup with a network policy controller like calico or not, and if so, what version of calico.

When you create a k8s Service resource with type LoadBalancer (proxy-public), GKE automatically does things you can observe in GCP - such as create firewall rules etc, and binds itself to the IP you have requested, etc.

I don't know, I have no clue what is wrong - but I'm confident you have a failure to connect between the Public Let's Encrypt servers and your autohttps pod running traefik, or traefik itself refused network connectivity. What I can't tell, so you have to check each step along the way.

Do your best to investigate this further on your own - I can't invest more time into this but I have such a hard time not answering questions while I want to keep track of other notifications on github =/

matthew-brett commented 2 years ago

Yes, I understand. I posted here because I assumed that, because I had the absolute default setup, it would be relevant for others. Or at least, if I don't have the absolutely default setup, I do not understand why not :)

kubectl get svc proxy-public -o yaml shows:

status:
  loadBalancer:
    ingress:
    - ip: 34.127.108.159

Anyway - it sounds as if it would be better to wait to see if someone else runs across this - and they can contribute here.

In the meantime, I think I will have to give up, and back off to a Littlest JupyterHub setup.

matthew-brett commented 2 years ago

And of course - I would predict that anyone cloning my repo and running those commands, will get the same effect.

consideRatio commented 2 years ago

I reproduced it. It seems to me that the success rate is almost 100% when you are pulling a new image in the autohttps pod, assuming proxy-public has a functional IP etc already.

I could reproduce it for traefik 2.4-2.6 versions. I figure this is about the CNI (what handles networking for containers) in GKE that has become slower or similar. Not sure. I'll try figure out if its the CNI, or perhaps if it is traefik that is slow to become ready.

At the moment I'm rate-limited by LE's staging server.

matthew-brett commented 2 years ago

That's great that you could reproduce it - kind of! And sorry for taking your focus - it's kind of you to persist.

Economax commented 2 years ago

Hi guys, I solved this issue by deleting the autohttps pod, forcing a restart of the service. SSL is now enabled for my JupyterHub, where it threw an invalid SSL cert error before.

This post shows the approach: https://discourse.jupyter.org/t/trouble-getting-https-letsencrypt-working-with-0-9-0-beta-4/3583/5

For completeness:

Find the JupyterHub namespace: (You defined your namespace when you created the JupyterHub instance if you did it with Helm)

kubectl get namespace

kubectl get pods --namespace <your-namespace>

Now identify the pod that contains "autohttps" and copy the full name to your clipboard.

kubectl delete pods --namespace <your-namespace> <autohttps-podname>

After deletion, Kubernetes will automatically spin up a new autohttps pod. You can verify this by listing the pods again:

kubectl get pods --namespace <your-namespace>

Where the new autohttps pod should now have a different name than your original one.

This is tested and confirmed to work with JupyterHub Helm chart version 1.1.1 and the Kubernetes installation instructions from the official docs as of 2022-02-24.

EDIT: This was deployed and tested on GCP/GKE.

consideRatio commented 2 years ago

I think I've nailed it as I could make sure it didn't occur by introducing a delay from when the k8s Pod had been scheduled on a node and received an IP. Either by pulling a new image on a node, or by tweaking the startup command to first run sleep 10 before starting Traefik as usual.

I've proposed a new feature for Traefik in https://github.com/traefik/traefik/issues/8803. But for now, a workaround could be to redeploy by setting new image tags to force restarts of the autohttps pod. Example config:

proxy:
  traefik:
    image:
      # tag modified to trigger a restart of the autohttps pod
      # and induce a delay while downloading the image
      # that ensures networking gets setup in time
      # which allows the requested ACME challenge
      # where the Pod will receive inbound network traffic
      # can succeed.
      tag: 2.6.0 # default is 2.6.1

Another option to this is to edit the autohttps deployment like this.

kubectl edit deploy autohttps

       # ...
       containers:
       - image: traefik:v2.6.1
+        command: ["sh", "-c", "sleep 10 && /entrypoint.sh traefik"]
         imagePullPolicy: IfNotPresent
         # ...

meeseeksmachine commented 2 years ago

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/trouble-getting-https-letsencrypt-working-with-0-9-0-beta-4/3583/12

Economax commented 2 years ago

Thanks for really good suggestions, @consideRatio!

What do you think about programming in a restart of the autohttps pod after deployment instead? While it might not be a good solution for traefik users in general, it could be a decent solution for JupyterHub users. Looks like it could be programmed with Terraform or a similar service.

Would avoid both disadvantages you mentioned in the traefik github issue, wouldn't it?

Does pod deletion alternatively introduce potential problems/errors that I'm not aware of?

Please excuse all the questions here, however I think you're more well versed in kubernetes and traefik than I am!

consideRatio commented 2 years ago

What do you think about programming in a restart of the autohttps pod after deployment instead?

Such workaround is problematic for other reasons. For example, you can be rate-limited from Let's Encrypt if you repeatedly fail - say for example that you had forgot to configure the domain correctly. You don't want that to happen by mistake.

I think the only healthy choice is to not re-attempt, but perhaps also to wait a bit by default before making an attempt as it is a quite rare thing you need to do, so waiting a while isn't very problematic. And, if this wait can be removed by configuration, its even less problematic.

matthew-brett commented 2 years ago

That sounds reasonable - a wait by default, that can be removed with a not-default configuration setting,

snickell commented 2 years ago

FWIW, we've had our prod cluster down for 24 hrs now largely due to this "race condition" on GKE, and this is the 3rd time in six months its cost me more than 6hrs. I should learn and remember how to hack around this, and of course I'll doc it this time, but yikes, very hard to debug even once you figure out how to turn debug logs on!

consideRatio commented 2 years ago

@snickell yeah it is a very complicated interaction to ensure we have fully functional. I've invested a lot of time to debug and fix issues. In z2jh that had a readiness probe, in traefik that had race conditions in its software, and now we have a race condition in the CNI of the k8s cluster that setups the networking vs the container startup.

I see it as a regression in GKE's CNI that is quite problematic as not being able to count on inbound traffic is supported as soon as the pod starts is a big downside. I think the best solution for z2jh is to have a configurable delay for traefik so that we don't risk making multiple attempts to LE and getting rate limited for a long time etc, while also giving us a good chance to start up properly right away.

matthew-brett commented 2 years ago

@consideRatio - I see that the Traefik issue is closed now, in favor of other workarounds.

I assume the initContainers workaround suggested by Martin Jakobsson is the right one?

deployment:
  initContainers:
    - name: startup-delay
      image: busybox:1.31.1
      command: ['sh', '-c', 'sleep 5']

What is the cleanest way of applying the workaround?

consideRatio commented 2 years ago

@matthew-brett yepp I suggest anyone having trouble with this goes for an initContainer workaround.

proxy:
  traefik:
    extraInitContainers:
      # This startup delay can help the k8s container network interface setup
      # network routing of traffic to the pod, which is essential for the ACME
      # challenge submitted by Traefik on startup to acquire a TLS certificate.
      #
      # Sleeping for 7 seconds has been consistently sufficient to avoid issues
      # in GKE when this workaround was explored initially for GKE.
      #
      - name: startup-delay
        image: busybox:stable
        command: ["sh", "-c", "sleep 10"]

If this problem remains, it would be great to have this documented as well. I currently think of this as a bug in GKE's setup though, and hope its fixed over time. Please note that its relevant to know if its fixed or not fixed, so feel free to comment about that if you who reads this has an update about it, make sure to mention the GKE version and the Helm chart version if you experience a intermittent failure to acquire certs.

consideRatio commented 2 years ago

kube-proxy, which is not an in-line proxy, but an egress-based load-balancing controller, watches the Kubernetes API server and continually maps the ClusterIP to healthy Pods by adding and removing destination NAT (DNAT) rules to the node's iptables subsystem. When a container running in a Pod sends traffic to a Service's ClusterIP, the node selects a Pod at random and routes the traffic to that Pod.

Perhaps something was changed to make kube-proxy slower, either by GKE configuration or setup, or by kube-proxy itself.

kube-proxy is documented here: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/

consideRatio commented 2 years ago

@matthew-brett hmmm actually this may be it.

I did kubectl get ds -o yaml kube-proxy and observed the following.

      - command:
        - /bin/sh
        - -c
        - kube-proxy --cluster-cidr=10.72.0.0/14 --oom-score-adj=-998 --v=2 --feature-gates=DynamicKubeletConfig=false,InTreePluginAWSUnregister=true,InTreePluginAzureDiskUnregister=true,InTreePluginOpenStackUnregister=true,InTreePluginvSphereUnregister=true,PodDeletionCost=true
          --iptables-sync-period=1m --iptables-min-sync-period=10s --ipvs-sync-period=1m
          --ipvs-min-sync-period=10s --detect-local-mode=NodeCIDR 1>>/var/log/kube-proxy.log
          2>&1

--iptables-sync-period=1m (Default: 30s)
--iptables-min-sync-period=10s (Default: 1s)
--ipvs-sync-period=1m (Default: 30s)
--ipvs-min-sync-period=10s (no default documented)

I think these settings, if tweaked, could also fix the problem. I'd love to know more if it does, but I can't invest time to look into it atm.

Hansenq commented 10 months ago

@consideRatio thanks so much for your debugging here--I had this issue still with a Jupyterhub for Kubernetes deployment on GKE today. Same issue (looking at the traefik pod logs, we just weren't able to connect to the acme challenge, so it would never issue a certificate). Spent about 9 hours debugging it even after adding your fix in; turns out, something changed recently where 7 seconds isn't enough. Bumped it up to 20 seconds and the issues stopped.

consideRatio commented 10 months ago

Thank you for following up about this @Hansenq!!

If you have time and energy left, I'm curious about:

is it correct that your workaround was to apply a 20s delay by using an init container?
what is your k8s version?
is the GKE "dataplane v2" used?
how is kube-proxy's configured command (for comparison with https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/2601#issuecomment-1129777281 by doing kubectl get ds -o yaml kube-proxy)

Hansenq commented 10 months ago

Yup, that's my workaround! The issue I was facing was this (from the traefik pod):

time="2024-01-03T04:34:38Z" level=debug msg="Using HTTP Challenge provider." providerName=default.acme
time="2024-01-03T04:34:38Z" level=debug msg="legolog: [INFO] [jupyterhub-test.lightski.com] acme: Obtaining bundled SAN certificate"
time="2024-01-03T04:34:38Z" level=debug msg="legolog: [INFO] [jupyterhub-test.lightski.com] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/300212319836"
time="2024-01-03T04:34:38Z" level=debug msg="legolog: [INFO] [jupyterhub-test.lightski.com] acme: Could not find solver for: tls-alpn-01"
time="2024-01-03T04:34:38Z" level=debug msg="legolog: [INFO] [jupyterhub-test.lightski.com] acme: use http-01 solver"
time="2024-01-03T04:34:38Z" level=debug msg="legolog: [INFO] [jupyterhub-test.lightski.com] acme: Trying to solve HTTP-01"
time="2024-01-03T04:34:42Z" level=debug msg="legolog: [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/300212319836"
time="2024-01-03T04:34:42Z" level=error msg="Unable to obtain ACME certificate for domains \"jupyterhub-test.lightski.com\"" ACME CA="https://acme-v02.api.letsencrypt.org/directory" routerName=default@file rule="PathPrefix(`/`)" error="unable to generate a certificate for the domains [jupyterhub-test.lightski.com]: error: one or more domains had a problem:\n[j
upyterhub-test.lightski.com] acme: error: 400 :: urn:ietf:params:acme:error:connection :: 34.168.177.244: Fetching http://jupyterhub-test.lightski.com/.well-known/acme-challenge/aRNWnJaar79ufFNeAbcpyMTTCfW2FDlBXdaSEOSbGhE: Connection refused\n" providerName=default.acme
time="2024-01-03T04:35:00Z" level=debug msg="Unable to split host and port: address jupyterhub-test.lightski.com: missing port in address. Fallback to request host." providerName=acme
time="2024-01-03T04:35:00Z" level=debug msg="Retrieving the ACME challenge for jupyterhub-test.lightski.com (token \"aRNWnJaar79ufFNeAbcpyMTTCfW2FDlBXdaSEOSbGhE\")..." providerName=acme
time="2024-01-03T04:35:00Z" level=error msg="Error getting challenge for token retrying in 723.13307ms" providerName=acme
time="2024-01-03T04:35:00Z" level=error msg="Error getting challenge for token retrying in 792.620258ms" providerName=acme
time="2024-01-03T04:35:01Z" level=error msg="Error getting challenge for token retrying in 590.054227ms" providerName=acme
time="2024-01-03T04:35:02Z" level=error msg="Error getting challenge for token retrying in 1.394355836s" providerName=acme
time="2024-01-03T04:35:03Z" level=debug msg="Unable to split host and port: address jupyterhub-test.lightski.com: missing port in address. Fallback to request host." providerName=acme
time="2024-01-03T04:35:03Z" level=debug msg="Retrieving the ACME challenge for jupyterhub-test.lightski.com (token \"aRNWnJaar79ufFNeAbcpyMTTCfW2FDlBXdaSEOSbGhE\")..." providerName=acme

To be completely fair, I tried about 10 different ways to get it to work (including getting rid of my custom baseUrl and hub_routespec which I thought were causing problems, trying different domains, etc.). I'm 90% sure it's the 20 seconds that got it to work. The other potential alternative is that traefik was failing with a host with a subdomain (e.g. jupyter-test.lightski.com), and as soon as I added a host without a subdomain (e.g. myothersubdomain.com) it started to work. I'm pretty sure it's not the latter since that...just doesn't seem right, but I was a bit tired from all the debugging when it finally started to work.

Client Version: v1.28.4, Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3, Server Version: v1.28.3-gke.1203001

Not sure I have that

[22:22:44] CS/lightski(main)%> kubectl get ds --namespace=jupyterhub-prod
NAME                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
continuous-image-puller   2         2         2       2            2           <none>          10h
hook-image-puller         2         2         2       2            2           <none>          87m
[22:22:51] CS/lightski(main)%> kubectl get ds kube-proxy --namespace=jupyterhub-prod
Error from server (NotFound): daemonsets.apps "kube-proxy" not found

consideRatio commented 10 months ago

Thank you for taking the time to follow this up so thoroughly!!

You can find the kube-proxy daemonset in the kube-system namespace (its not used with dataplane v2, but you should see it there as dataplane v2 was disabled) so:

kubectl get ds -o yaml kube-proxy -n kube-system

Hansenq commented 10 months ago

oh no, same to you, for a 1.5 year old issue! You were right; here's the relevant section of the yaml (let me know if you want the whole thing):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
  creationTimestamp: "2024-01-02T19:16:04Z"
  generation: 1
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    k8s-app: kube-proxy
  name: kube-proxy
  namespace: kube-system
  resourceVersion: "926"
  uid: 40286d34-e3d3-4cff-a8c4-a68c802de62d
spec:
  revisionHistoryLimit: 10
  template:
    spec:
      containers:
      - command:
        - /bin/sh
        - -c
        - kube-proxy --cluster-cidr=10.124.0.0/14 --oom-score-adj=-998 --v=2 --feature-gates=InTreePluginAWSUnregister=true,InTreePluginAzureDiskUnregister=true,InTreePluginvSphereUnregister=true,DisableKubeletCloudCredentialProviders=true,UnauthenticatedHTTP2DOSMitigation=true
          --iptables-sync-period=1m --iptables-min-sync-period=10s --ipvs-sync-period=1m
          --ipvs-min-sync-period=10s --detect-local-mode=NodeCIDR 1>>/var/log/kube-proxy.log
          2>&1
        env:
        - name: KUBERNETES_SERVICE_HOST
          value: 10.138.0.9
        image: gke.gcr.io/kube-proxy-amd64:v1.28.3-gke.1203001
        image: gke.gcr.io/kube-proxy-amd64:v1.28.3-gke.1203001
        imagePullPolicy: IfNotPresent
        name: kube-proxy

jupyterhub / zero-to-jupyterhub-k8s