rancher / webhook

Rancher webhook for Kubernetes
Apache License 2.0
22 stars 62 forks source link

Startup probe failed: HTTP probe failed with statuscode: 500 #291

Open robcharlwood opened 11 months ago

robcharlwood commented 11 months ago

Hi

We are seeing a problem with the latest version of the rancher-webhook (0.3.5) when running alongside the latest rancher (2.7.6). In both the Rancher HA cluster and imported K3S and GKE downstream clusters, the webhook pod has a warning about startup probe checks failing with status code 500.

Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  15s               default-scheduler  Successfully assigned cattle-system/rancher-webhook-998454b77-nvch5 to <redacted>
  Normal   Pulled     14s               kubelet            Container image "rancher/rancher-webhook:v0.3.5" already present on machine
  Normal   Created    14s               kubelet            Created container rancher-webhook
  Normal   Started    14s               kubelet            Started container rancher-webhook
  Warning  Unhealthy  5s (x2 over 10s)  kubelet            Startup probe failed: HTTP probe failed with statuscode: 500

If left for long enough, it eventually starts failing with a liveness probe error:

Events:
  Type     Reason     Age                 From     Message
  ----     ------     ----                ----     -------
  Warning  Unhealthy  41m (x52 over 19h)  kubelet  Liveness probe failed: Get "https://XXX.XXX.XXX.XXX:9443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

This is only ever generated as a warning and the pod itself never becomes unhealthy. The pod itself also does not give any useful logs:

time="2023-09-13T10:22:52Z" level=info msg="Rancher-webhook version v0.3.5 (2e89c65) is starting"
time="2023-09-13T10:22:52Z" level=info msg="Active TLS secret cattle-system/cattle-webhook-tls (ver=5511970) (count 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=XXXXXXXXXXXXXXXXXXXXXXXXXXXX]"
time="2023-09-13T10:22:52Z" level=info msg="Listening on :9443"
time="2023-09-13T10:22:52Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=Cluster controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=ClusterRoleTemplateBinding controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=GlobalRole controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2023-09-13T10:22:52Z" level=info msg="Sleeping for 15 seconds then applying webhook config"
time="2023-09-13T10:22:52Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=RoleBinding controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting management.cattle.io/v3, Kind=PodSecurityAdmissionConfigurationTemplate controller"
time="2023-09-13T10:22:52Z" level=info msg="Starting provisioning.cattle.io/v1, Kind=Cluster controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting management.cattle.io/v3, Kind=ProjectRoleTemplateBinding controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=Role controller"
time="2023-09-13T10:22:53Z" level=info msg="Starting management.cattle.io/v3, Kind=RoleTemplate controller"
time="2023-09-13T10:22:53Z" level=info msg="Updating TLS secret for cattle-system/cattle-webhook-tls (count: 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=XXXXXXXXXXXXXXXXXXXXXXXXXXXX]"

This rancher is deployed in the following manner:

Any help or advice on this issue would be appreciated.

Many thanks!

KevinJoiner commented 11 months ago

I don't have any immediate solutions to your problem, but it looks like the root cause is that kube-apiserver cannot communicate with the container running on the cluster.

To verify the problem is not with the webhook, you can validate the webhook configuration was created successfully kubectl get validatingwebhookconfigurations rancher.cattle.io -o yaml

Your other solution, which is available on Rancher:v2.7-head but has not been released yet would be to have the webhook run on port 443 https://github.com/rancher/rancher/issues/41142#issuecomment-1670123499

robcharlwood commented 11 months ago

@KevinJoiner Thanks! I will check this and get back to you.

robcharlwood commented 10 months ago

@KevinJoiner - So I ran the suggested command and YAML was returned successfully. I can't see anything problematic in the output. Is there anything specific I should be looking for?

KevinJoiner commented 10 months ago

@robcharlwood No, if the resource exists and the webhook is not logging any errors we can have higher confidence that the problem is with the connection between the kube-apiserver and the rancher-webhook pod.

  1. I would double checks the steps for adding the firewall rule to make sure it is correctly configured since the symptoms seem to match https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#api_request_that_triggers_admission_webhook_timing_out

  2. You can try to edit the deployment of the Webhook and remove the startupProbe and livelinessProbe and see if things start to work. I don't expect this to fix the problem since other requests will most likely time out when you try to create a RoleTemplate, but if it does work, we might have a bug on our side.

robcharlwood commented 10 months ago

@KevinJoiner Thanks! I will investigate and report back!

danipanz commented 8 months ago

We are experiencing the same issue.

Firewall rules allow any communication between nodes (trusted)

danipanz commented 8 months ago

Adding some extra info:

For the moment being we will try and remove the startupProbe and livelinessProbe

Vox1984 commented 7 months ago

I have very same issue:

Rancher UI: 2.7.9
RKE version: v1.5.1   
K8s: v1.25.16

kubectl describe pod -n cattle-system rancher-webhook-7879bb6c5-vb7ss

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  85s (x28071 over 2d1h)  kubelet  Startup probe failed: HTTP probe failed with statuscode: 500

Problem started after upgrading from previous version.

rahadiangg commented 2 months ago

I have same issue

Rancher chart: rancher-2.8.1
Rancher webhook chart: rancher-webhook-103.0.1+up0.4.2
Kubernetes: v1.27.13

When try to hit the /healthz endpoint, I got this log message:

[-]Config Applied failed: reason withheld
healthz check failed

I'm struggle with reason withheld error because can't find out what the root cause.