rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.54k stars 267 forks source link

No endpoints available for service "rke2-ingress-nginx-controller-admission" when trying to use with Rancher #3958

Closed curtisy1 closed 1 year ago

curtisy1 commented 1 year ago

Environmental Info: RKE2 Version:

rke2 version v1.24.10+rke2r1 (1ccdce2571291649b9414af1f269f645c3fe4002) go version go1.19.5 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Linux Ubuntu-2204-jammy-amd64-base 5.15.0-60-generic # 66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 server, 3 agents

Describe the bug:

Trying to set up Rancher together with RKE2 does not work for me anymore. This happens on a clean install from a Hetzner dedicated root server, where I can 100% reproduce this behaviour (clean install meaning there's no other dependencies installed except for the Ubuntu minimal stuff).

EDIT: I should probably mention that it works just fine with k3s and traefik, so this seems to be rke2 related.

Steps To Reproduce:

Here's a gist of the setup script I'm currently using. I'm sure it could be improved, but it used to work just fine before. It basically boils down to:

Expected behavior:

I get a nice and shiny Rancher UI I can use. Be it using my own subdomain or a sample DNS entry using sslip.io

Actual behavior:

The install fails at the last step.

Additional context / logs:

INSTALLATION FAILED: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": failed to call webhook: Post "https://rke2-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1/ingresses?timeout=10s": no endpoints available for service "rke2-ingress-nginx-controller-admission"

brandond commented 1 year ago

Have you looked at the pod status or error logs to see why there are no endpoints? Its hard to help without knowing what is actually going on with the pods.

curtisy1 commented 1 year ago

Gotcha! So far, I haven't looked at the logs but I'll try to reproduce again and get some logs later today on my personal machine when I'm home since I'm happy it's working fine now with k3s (never change a running system).

curtisy1 commented 1 year ago

Alright, sorry for the wait. Here's the output from kubectl -n kube-system logs rke2-ingress-nginx-controller-4zc26

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       nginx-1.4.1-hardened2
  Build:         git-452bd444e
  Repository:    https://github.com/rancher/ingress-nginx.git
  nginx version: nginx/1.19.10

-------------------------------------------------------------------------------

W0307 20:53:36.822501       7 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0307 20:53:36.823278       7 main.go:209] "Creating API client" host="https://10.43.0.1:443"
I0307 20:53:36.849764       7 main.go:253] "Running in Kubernetes cluster" major="1" minor="24" git="v1.24.10+rke2r1" state="clean" commit="5c1d2d4295f9b4eb12bfbf6429fdf989f2ca8a02" platform="linux/amd64"
I0307 20:53:36.959132       7 main.go:104] "SSL fake certificate created" file="/etc/ingress-controller/ssl/default-fake-certificate.pem"
I0307 20:53:36.977878       7 ssl.go:533] "loading tls certificate" path="/usr/local/certificates/cert" key="/usr/local/certificates/key"
I0307 20:53:37.020201       7 nginx.go:260] "Starting NGINX Ingress controller"
I0307 20:53:37.057882       7 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"rke2-ingress-nginx-controller", UID:"931c99f4-d21f-4aad-ae02-1c3ba69dea81", APIVersion:"v1", ResourceVersion:"1122", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap kube-system/rke2-ingress-nginx-controller
I0307 20:53:38.222033       7 nginx.go:303] "Starting NGINX process"
I0307 20:53:38.222210       7 leaderelection.go:248] attempting to acquire leader lease kube-system/ingress-controller-leader...
I0307 20:53:38.223991       7 nginx.go:323] "Starting validation webhook" address=":8443" certPath="/usr/local/certificates/cert" keyPath="/usr/local/certificates/key"
I0307 20:53:38.230060       7 controller.go:168] "Configuration changes detected, backend reload required"
I0307 20:53:38.241260       7 leaderelection.go:258] successfully acquired lease kube-system/ingress-controller-leader
I0307 20:53:38.241934       7 status.go:84] "New leader elected" identity="rke2-ingress-nginx-controller-4zc26"
I0307 20:53:38.265937       7 status.go:214] "POD is not ready" pod="kube-system/rke2-ingress-nginx-controller-4zc26" node="main"
I0307 20:53:38.367269       7 controller.go:185] "Backend successfully reloaded"
I0307 20:53:38.367617       7 controller.go:196] "Initial sync, sleeping for 1 second"
I0307 20:53:38.367784       7 event.go:285] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"rke2-ingress-nginx-controller-4zc26", UID:"95532d48-8ee0-4291-833c-d91ee4f7e9fe", APIVersion:"v1", ResourceVersion:"1155", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration

Oddly enough, it seems to start up just fine. However, there's another interesting correlation with this error from looking at what

helm install rancher rancher-stable/rancher \
    --namespace cattle-system \
    --set hostname="[Server IP].sslip.io" \
    --set bootstrapPassword="$ServerPassword"

throws at me

E0307 21:53:21.472133   91149 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

I suspect this has something to do with the ingress because it's trying to access it during startup? Because the next thing I get after this error message is the good old

Error: INSTALLATION FAILED: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": failed to call webhook: Post "https://rke2-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1/ingresses?timeout=10s": no endpoints available for service "rke2-ingress-nginx-controller-admission"

If you need any other log files, feel free to ask

brandond commented 1 year ago

The logs look fine but for some reason there are no endpoints for the webhook service. Can you get the output of kubectl get pod -A -o wide and kubectl get service -A -o wide ?

curtisy1 commented 1 year ago

Sure thing! Here's the outputs of kubectl get pod -A -o wide

NAMESPACE      NAME                                                    READY   STATUS      RESTARTS   AGE    IP               NODE                NOMINATED NODE   READINESS GATES
cert-manager   cert-manager-85945b75d4-z568w                           1/1     Running     0          99s    10.42.0.4        ubuntu-4gb-fsn1-1   <none>           <none>
cert-manager   cert-manager-cainjector-7f694c4c58-4ftsq                1/1     Running     0          99s    10.42.0.7        ubuntu-4gb-fsn1-1   <none>           <none>
cert-manager   cert-manager-webhook-7cd8c769bb-ddtjs                   1/1     Running     0          99s    10.42.0.6        ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    cloud-controller-manager-ubuntu-4gb-fsn1-1              1/1     Running     0          112s   123.201.116.44   ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    etcd-ubuntu-4gb-fsn1-1                                  1/1     Running     0          111s   123.201.116.44   ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    helm-install-rke2-canal-9q4rk                           0/1     Completed   0          100s   123.201.116.44   ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    helm-install-rke2-coredns-bb6qv                         0/1     Completed   0          100s   123.201.116.44   ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    helm-install-rke2-ingress-nginx-m2jmk                   0/1     Completed   0          100s   10.42.0.3        ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    helm-install-rke2-metrics-server-jwwbx                  0/1     Completed   0          100s   10.42.0.8        ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    kube-apiserver-ubuntu-4gb-fsn1-1                        1/1     Running     0          112s   123.201.116.44   ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    kube-controller-manager-ubuntu-4gb-fsn1-1               1/1     Running     0          105s   123.201.116.44   ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    kube-proxy-ubuntu-4gb-fsn1-1                            1/1     Running     0          108s   123.201.116.44   ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    kube-scheduler-ubuntu-4gb-fsn1-1                        1/1     Running     0          105s   123.201.116.44   ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    rke2-canal-65fhj                                        2/2     Running     0          89s    123.201.116.44   ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    rke2-coredns-rke2-coredns-58fd75f64b-4ftgk              1/1     Running     0          90s    10.42.0.5        ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    rke2-coredns-rke2-coredns-autoscaler-768bfc5985-p9q6d   1/1     Running     0          90s    10.42.0.9        ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    rke2-ingress-nginx-controller-fx5vc                     1/1     Running     0          48s    10.42.0.12       ubuntu-4gb-fsn1-1   <none>           <none>
kube-system    rke2-metrics-server-74f878b999-w92rj                    1/1     Running     0          58s    10.42.0.10       ubuntu-4gb-fsn1-1   <none>           <none>

And the output of kubectl get service -A -o wide, respectively

NAMESPACE       NAME                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE     SELECTOR
cattle-system   rancher                                   ClusterIP   10.43.125.234   <none>        80/TCP,443/TCP   75s     app=rancher
cert-manager    cert-manager                              ClusterIP   10.43.232.23    <none>        9402/TCP         2m22s   app.kubernetes.io/component=controller,app.kubernetes.io/instance=cert-manager,app.kubernetes.io/name=cert-manager
cert-manager    cert-manager-webhook                      ClusterIP   10.43.27.97     <none>        443/TCP          2m22s   app.kubernetes.io/component=webhook,app.kubernetes.io/instance=cert-manager,app.kubernetes.io/name=webhook
default         kubernetes                                ClusterIP   10.43.0.1       <none>        443/TCP          2m31s   <none>
kube-system     rke2-coredns-rke2-coredns                 ClusterIP   10.43.0.10      <none>        53/UDP,53/TCP    2m5s    app.kubernetes.io/instance=rke2-coredns,app.kubernetes.io/name=rke2-coredns,k8s-app=kube-dns
kube-system     rke2-ingress-nginx-controller-admission   ClusterIP   10.43.208.50    <none>        443/TCP          83s     app.kubernetes.io/component=controller,app.kubernetes.io/instance=rke2-ingress-nginx,app.kubernetes.io/name=rke2-ingress-nginx
kube-system     rke2-metrics-server                       ClusterIP   10.43.76.197    <none>        443/TCP          93s     app=rke2-metrics-server,release=rke2-metrics-server
brandond commented 1 year ago

Are you perhaps just trying to install Rancher before all the components are done starting up? Everything looks fine now. If you try to install Rancher before nginx is done starting, yeah you'll get errors because the webhook has been added but nginx isn't ready yet.

curtisy1 commented 1 year ago

That... seems to be it. At least if I manually wait for a few seconds I can install without any errors. Thanks for clearing that up!

I'm guessing what's still confusing me is how this used to work without any wait on my end before but since I can't pinpoint the release I used anymore and it might very well have been a due to overall machine slowness at the time, I'll close this as solved since this clears up my confusion.

Thanks again for helping me out with this!

e-minguez commented 1 year ago

Just in case, if you want to ensure the nginx controller has been already deployed you can use:

while ! kubectl rollout status daemonset -n kube-system rke2-ingress-nginx-controller --timeout=60s; do sleep 2 ; done