k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.94k stars 2.33k forks source link

Random webhook failures with EOF #5835

Closed ansarhun closed 2 years ago

ansarhun commented 2 years ago

Environmental Info: K3s Version:

k3s version v1.23.8+k3s1 (53f2d4e7) go version go1.17.5

Node(s) CPU architecture, OS, and Version:

Linux k8s-controller-1 5.15.0-39-generic #42-Ubuntu SMP Thu Jun 9 23:42:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 server, 3 agent

Describe the bug:

When FluxCD tries to do a dry-run on resources where webhook is configured (e.g.: cert-manager) the dry-run fails with EOF error. On the webhook side it looks like it's not a proper https/tls connection: first record does not look like a TLS handshake

Steps To Reproduce:


**Expected behavior:**
<!-- A clear and concise description of what you expected to happen. -->
Dry run succeeds

**Actual behavior:**
<!-- A clear and concise description of what actually happened. -->
Dry run fails sometimes with the same configuration 

**Additional context / logs:**
<!-- Add any other context and/or logs about the problem here. -->
Dry run error:

dry-run failed, reason: InternalError, error: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "https://kube-prometheus-stack-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": EOF


k3s logs from journalctl

k3s[682]: W0710 08:03:32.023261 682 dispatcher.go:195] Failed calling webhook, failing closed webhook.cert-manager.io: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.network.svc:443/mutate?timeout=10s": EOF


cert-manager webhook pod logs

I0710 08:03:32.022971 1 logs.go:59] http: TLS handshake error from 10.42.0.0:50900: tls: first record does not look like a TLS handshake



**Backporting**
<!--
Does this fix need to be backported to older releases?  If so, please check this box and list the branches it should be backported to.
-->
- [ ] Needs backporting to older releases
consideRatio commented 2 years ago

Note that I found this issue by searching for first record does not look like a TLS handshake which is what I observe when debugging #5633. So, maybe worth trying workarounds like suggested there.

One person resolved their issue by passing --egress-selector-mode=agent, while I had success with --egress-selector-mode=disabled, and observed the need to set that if using a version newer than v1.22.10+k3s1, v1.23.7+k3s1, or v1.24.1+k3s1.

ansarhun commented 2 years ago

@consideRatio thank you! I will try the egress option to see if that helps

brandond commented 2 years ago

Try updating to a newer release. There were a few issues discovered with the egress-selector support which have since been addressed.

consideRatio commented 2 years ago

@brandond I can reproduce the intermittent issues in the latest three minor releases.

Here are three intermittent failures including logs for the latest versions of k3s. The logs are available in a later "step" under journalctl -xe.

EDIT: the same issue is observed with the RC1 for 1.24.3, 1.23.9, and 1.22.12.

ansarhun commented 2 years ago

Same result here. After upgrading to the latest stable (v1.23.8+k3s2) version the issue still occurs. For the egress-selector-mode option agent mode did not worked for me either but the disabled seems to work or at least I didn't notice any new error in the last 12h.

brandond commented 2 years ago

We're about to release RCs for today's upstream patch releases, which will include some updates to the egress-selector code and remotedialer library. I'm curious if this will address the issues you're seeing.

marshallford commented 2 years ago

Looking forward to the RCs, I'm having the same issue as well on 1.24.2+k3s1 and 1.24.2+k3s2. Observing both the first record does not look like a TLS handshake message from the k3s logs and the error dialing backend: EOF message from kubectl. Just my luck that I've been testing a move from Calico to Cilium alongside an nftables config change around the time this issue occurred...

brandond commented 2 years ago

I'd anyone can provide instructions to reproduce it on demand that'd be great. I've tried with Calico+Docker which there was a report of causing issues but haven't seen any problems yet.

marshallford commented 2 years ago

Upgrading to 1.24.3+k3s1 didn't fix my issue, however setting egress-selector-mode to disabled did seem to work.

consideRatio commented 2 years ago

I suggest this is marked as a duplicate of https://github.com/k3s-io/k3s/issues/5633 where more discussion has been made.

ansarhun commented 2 years ago

With the current workaround I did not notice any more errors, and I was unable to create a small project where I can reproduce the issue reliably.

riuvshyn commented 2 years ago

having exactly same issue. but for me it is more constant than random failures. I was trying to upgrade k3s from current version v1.21.11+k3s1 where I don't have such issues to v1.22.12+k3s1 and v1.23.9+k3s1, for both 1.22.x and 1.23.x I am experiencing issues with mutating / validating webhooks.

as an example I am running rancher on that k3s cluster and when I am trying to create new cluster with rancher I am constantly getting:

Internal error occurred: failed calling webhook "rancher.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation?timeout=10s": EOF

I didn't try to set egress-selector-mode yet.

UPDATE:

I've also tried upgrade to v1.24.3+k3s1 and hit the same issue with webhooks, however setting egress-selector-mode: disabled seems fixes this issue, once I've set it to disable I do not see any failed requests to webhooks

brandond commented 2 years ago

@riuvshyn can you provide more information on your configuration? Are you using the default flannel CNI, or have you replaced it with something else?

onedr0p commented 2 years ago

For me, I'm using Calico.

riuvshyn commented 2 years ago

@brandond sure, I am using aws eni cni. my k3s config file:

egress-selector-mode: disabled

node-name: $${LOCAL_HOSTNAME}
flannel-backend: none
disable-cloud-controller: true
node-ip: $${LOCAL_IPV4}
node-external-ip: $${LOCAL_IPV4}
kube-apiserver-arg:
- "service-account-key-file=/etc/rancher/service-account.pub"
- "service-account-issuer=${service_account_issuer}"
- "service-account-signing-key-file=/etc/rancher/service-account.key"
- "enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,NodeRestriction,Priority,TaintNodesByCondition,PersistentVolumeClaimResize,PodSecurityPolicy"
- "audit-log-path=/var/log/k8s-audit/audit.log"
- "audit-policy-file=/etc/rancher/k3s/audit_policy.yaml"
- "audit-log-maxage=90"
- "audit-log-maxbackup=10"
- "audit-log-maxsize=500"
kube-controller-arg:
- "service-account-private-key-file=/etc/rancher/service-account.key"
kubelet-arg:
- "cloud-provider=external"
- "provider-id=$${CLOUD_CONTROLLER_PROVIDER_ID}"

tls-san:
- "${k3s_tls_san}"
no-deploy:
- "traefik"
- "metrics-server"
- "servicelb"
consideRatio commented 2 years ago

My understanding is that this kind of failure (assumed the same as #5633) has been seen on a wide variety of configurations of k3s. My overview is that we have seen reports of this occur for:

blackandred commented 2 years ago

Also impacted by this issue on K3s with Calico, Flannel and Wireguard. Version: v1.24.3+k3s1

$ kubectl exec -it argocd-repo-server-695669547d-n2scn -- bash
Defaulted container "argocd-repo-server" out of: argocd-repo-server, copyutil (init)
Error from server: error dialing backend: EOF

$ kubectl exec -it argocd-repo-server-695669547d-n2scn -- bash
Defaulted container "argocd-repo-server" out of: argocd-repo-server, copyutil (init)
Error from server: error dialing backend: EOF

$ kubectl exec -it argocd-repo-server-695669547d-n2scn -- bash
Defaulted container "argocd-repo-server" out of: argocd-repo-server, copyutil (init)
argocd@argocd-repo-server-695669547d-n2scn:~$
brandond commented 2 years ago

@blackandred can you start k3s with the --debug flag, and see if you get anything interesting in the logs when those exec commands fail? I continue to be unable to reproduce this on demand on recent releases.

kbpersonal commented 2 years ago

I can reproduce the issue on v1.25.0+k3s1 using Alpine Linux v3.16 , containerd://1.6.8-k3s1 for the container runtime, and calico as CNI. Need to present two demos tomorrow so I don't want to mess with my cluster right now but once that's done I'll try with --debug and doing A/B tests with the different egress-selector-mode options.

james-mchugh commented 2 years ago

I have run into the same issue on K3S 1.23.10+k3s1 and 1.22.15+k3s1. We frequently run into it when utilizing port-forwards via the Kubernetes API, but we have also seen it when running execs. We are using a vanilla K3S instance with no custom configs, except for using the Docker container runtime.

Here is the command we use to start the K3S server via systemd:

k3s server --docker

We have no additional custom configurations. This is the command used to install K3S within the VM.

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.23.10+k3s1" sh -s - --docker

This is on an Ubuntu 20.04 LTS VM.

Here are the debug logs from the latest encounter of the error:

TP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:35478"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:35482"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:02 ubuntu k3s[723111]: I1004 11:13:02.644540  723111 log.go:184] http: TLS handshake error from 127.0.0.1:49908: tls: first record does not look like a TLS handshake
Oct 04 11:13:02 ubuntu k3s[723111]: E1004 11:13:02.644893  723111 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"error dialing backend: EOF"}: error dialing backend: EOF
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:35492"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:03 ubuntu k3s[723111]: time="2022-10-04T11:13:03-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:55392"
Oct 04 11:13:03 ubuntu k3s[723111]: time="2022-10-04T11:13:03-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:04 ubuntu k3s[723111]: time="2022-10-04T11:13:04-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:55406"
Oct 04 11:13:04 ubuntu k3s[723111]: time="2022-10-04T11:13:04-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"

Setting egress-selector-mode: disabled in the k3s config seemed to resolve it.

I don't have a 100% guaranteed method of reproducing it, but for me, the easiest way is to execute 100 or so exec commands in a row. Usually, most of them will pass, but a few will produce the EOF errors.

for  i in $(seq 1 100); do kubectl exec <pod-name> -- ls -la > /dev/null; done
brandond commented 2 years ago
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:35482"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:02 ubuntu k3s[723111]: I1004 11:13:02.644540  723111 log.go:184] http: TLS handshake error from 127.0.0.1:49908: tls: first record does not look like a TLS handshake
Oct 04 11:13:02 ubuntu k3s[723111]: E1004 11:13:02.644893  723111 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"error dialing backend: EOF"}: error dialing backend: EOF

Is this a single-node cluster, and 172.16.67.5 is the local node? Do you see the same behavior when using the embedded containerd instead of docker?

james-mchugh commented 2 years ago

This is a single node cluster, and yes, 172.16.67.5 is the local node. I do see the same issue when using the embedded containerd. I know of two other users between my team and teams we've interfaced with who have seen similar issues. Here are the details:

User 1:

User 2:

brandond commented 2 years ago

What kind of cpu/memory/disk resources do these nodes have? What pod are you execing into; is one one of the packaged ones or something that you've deployed yourself? Are you running the kubectl command locally, or on another host?

I've tried load-testing both logs and exec requests, and haven't been able to trigger it on any of my dev hosts, or on an EC2 instance - so anything that might help me hit it on demand would be great.

I'm testing by doing this, which I believe should be pretty similar to what you're doing:

while true; do kubectl exec -n kube-system local-path-provisioner-5b5579c644-sphmc -- ls -la >/dev/null; done

I can leave that running for hours and not see any issue :(

james-mchugh commented 2 years ago

I've run into the issue on a machine with the following specs:

My Machine:

I've run into the issue when running kubectl commands on the host machine and within the Guest VM where Kubernetes is running. I have run into it with pods I have deployed and pods in the kube-system namespace such as the local-path-provisioner as you used in your example.

I don't have all of the details on the other users, but here is what I have

User 1:

User 2:

brandond commented 2 years ago

Thanks for the info!

james-mchugh commented 2 years ago

I tested this on my other computer where the host OS runs Arch Linux. I have not seen the error on this machine. Here are the specs:

LarsBingBong commented 2 years ago

We're also seeing this on Ubuntu 20.04 Kernel v5.13.0-37 from the HWE line. Are we potentially looking into this being a K3s<>Ubuntu combo issue?

onedr0p commented 2 years ago

We're also seeing this on Ubuntu 20.04 Kernel v5.13.0-37 from the HWE line. Are we potentially looking into this being a K3s<>Ubuntu combo issue?

That is doubtful, I am seeing the issue using Fedora Server as well.

brandond commented 2 years ago

I have been able to reproduce this on a node with more cores than my usual dev host and testing instances have, which suggests there may be some sort of race condition. It looks like something is occasionally consuming the first byte of the proxied TLS handshake, which of course breaks things. It feels a lot like https://github.com/golang/go/issues/27408 which had some fixes merged a while back but is still open?

LarsBingBong commented 2 years ago

Oh that's some deep level goodies nerdy stuff. However, that issue is from the wayback machine AKA year 2018. Of course it's hard to say - but what is the chances of that getting solved? But, yes - what go#27408 points to sounds quite interesting.

james-mchugh commented 2 years ago

Agreed, it is an interesting issue to keep an eye on, but there must be have been a change on on the K3S side (or at least one of its dependencies) that uncovered it. If that golang issue was the only culprit, I would expect to run into the same issue in K3S <= 1.21.x since that Golang bug was discovered back in Go 1.11

brandond commented 2 years ago

Yes, the k3s code that uses this was introduced in https://github.com/k3s-io/k3s/pull/5382 as part of an attempt to reduce the number of patches to upstream Kubernetes that we carry. That bit is not a mystery.

james-mchugh commented 2 years ago

Oh I see, that makes sense. Thank you for clarifying, and sorry for the misunderstanding

brandond commented 2 years ago

A fix for this should make it into this month's release cycle. Thanks for your patience everyone, and providing me with enough info to finally reproduce it reliably.

ShylajaDevadiga commented 2 years ago

Validated using k3s version v1.25.3-rc3+k3s1

Config.yaml

write-kubeconfig-mode: 644
cluster-init: true
kubelet-arg:
  - port=65535

Steps to reproduce with kubelet port overridden using kubelet-arg to use 65535

  1. Install k3s single-node cluster on a host with many cores (seemed to take 32 to reproduce reliably for me).

  2. Run the following command: POD=$(kubectl get pod -n kube-system -l app=local-path-provisioner -o jsonpath='{.items[].metadata.name}'); while true; do kubectl exec -n kube-system $POD -- ls -la > /dev/null || break; done The command should run essentially forever until stopped with no errors.

  3. Run k3s 1-server/1-agent with --kubelet-arg=port=65535 and port 65535 not open on firewall or security group between nodes; note that kubectl logs and kubectl exec for pod on agent node work properly.

Issue: Unable to exec or get logs from pods on agent node

ubuntu@ip-172-31-15-222:~$ kubectl get pods -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP          NODE               NOMINATED NODE   READINESS GATES
othertest-deploy-5mmpg   1/1     Running   0          8s    10.42.0.9   ip-172-31-15-222   <none>           <none>
othertest-deploy-mmt6z   1/1     Running   0          8s    10.42.1.3   ip-172-31-24-165   <none>           <none>

ubuntu@ip-172-31-15-222:~$ kubectl logs othertest-deploy-mmt6z 
Error from server: Get "https://172.31.24.165:65535/containerLogs/default/othertest-deploy-mmt6z/othertest": EOF

ubuntu@ip-172-31-15-222:~$ POD=$(kubectl get pod -n kube-system -l app=local-path-provisioner -o jsonpath='{.items[].metadata.name}'); while true; do kubectl exec -n kube-system $POD -- ls -la > /dev/null || break; done
Error from server: error dialing backend: EOF

Validation: kubectl logs and kubectl exec for pod on agent node work properly.

ubuntu@ip-172-31-9-182:~$ kubectl logs othertest-deploy-28lwn
2022/10/21 18:00:13 [notice] 1#1: using the "epoll" event method
2022/10/21 18:00:13 [notice] 1#1: nginx/1.21.4
2022/10/21 18:00:13 [notice] 1#1: built by gcc 10.2.1 20210110 (Debian 10.2.1-6) 
2022/10/21 18:00:13 [notice] 1#1: OS: Linux 5.15.0-1019-aws
2022/10/21 18:00:13 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2022/10/21 18:00:13 [notice] 1#1: start worker processes
2022/10/21 18:00:13 [notice] 1#1: start worker process 11
2022/10/21 18:00:13 [notice] 1#1: start worker process 12

ubuntu@ip-172-31-9-182:~$ POD=$(kubectl get pod -n kube-system -l app=local-path-provisioner -o jsonpath='{.items[].metadata.name}'); while true; do kubectl exec -n kube-system $POD -- ls -la > /dev/null || break; done
sourcehawk commented 1 year ago

I'm still seeing this bug in version 1.25.2 of k3s. Happens randomly every 2-3 runs or so of a deployment pipeline of mine where I am deploying a helm chart that manages a TLS certificate.

Error: UPGRADE FAILED: cannot patch "production-wildcard-certificate" with kind Certificate: 
   Internal error occurred: 
     failed calling webhook "webhook.cert-manager.io": 
       failed to call webhook: 
         Post "[https://cert-manager-webhook.cert-manager.svc:443/validate?timeout=10s]
         (https://cert-manager-webhook.cert-manager.svc/validate?timeout=10s)": EOF
brandond commented 1 year ago

@hauks96 v1.25.2 is almost a year old at this point; it is certainly from before any of the above-linked PRs that fixed the issue were merged. Please update to a newer release, or at least the latest 1.25 patch release.