Too many endpoints in the kubernetes service

fvasco commented 2 years ago

/kind bug

1. What kops version are you running?

Version 1.21.2 (git-f86388fb1ec8872b0ca1819cf98f84d18f7263a4)

2. What Kubernetes version are you running?

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:04:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using? AWS

Hello, we got a connectivity issue with our pods.

We currently see too many IPs in the kubernetes endpoints, many of them reference to terminated masters.

$ kubectl describe service kubernetes -n default
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                100.64.0.1
IPs:               100.64.0.1
Port:              https  443/TCP
TargetPort:        443/TCP
Endpoints:         172.31.20.192:443,172.31.23.72:443,172.31.31.210:443 + 6 more...
Session Affinity:  ClientIP
Events:            <none>

the iptable nat table list all of them

# iptables -L -v -n -t nat
Chain KUBE-SVC-NPX46M4PTMTKRN6Y (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    2   120 KUBE-SEP-XXDHYK5XOL7C2QMK  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-XXDHYK5XOL7C2QMK side: source mask: 255.255.255.255
    0     0 KUBE-SEP-T2U4L34UORPF3KEV  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-T2U4L34UORPF3KEV side: source mask: 255.255.255.255
    0     0 KUBE-SEP-H7IE5EIZNU7MRD2J  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-H7IE5EIZNU7MRD2J side: source mask: 255.255.255.255
    0     0 KUBE-SEP-IKGK4FJJWDT2DVXT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-IKGK4FJJWDT2DVXT side: source mask: 255.255.255.255
    0     0 KUBE-SEP-EQDDZXJQDFIUJYBY  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-EQDDZXJQDFIUJYBY side: source mask: 255.255.255.255
    0     0 KUBE-SEP-ROWRN2NBYRGLX5XA  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-ROWRN2NBYRGLX5XA side: source mask: 255.255.255.255
    0     0 KUBE-SEP-KWMDMS2VX7LWBYQL  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-KWMDMS2VX7LWBYQL side: source mask: 255.255.255.255
    0     0 KUBE-SEP-T77HFFV4XKKCYE7Z  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-T77HFFV4XKKCYE7Z side: source mask: 255.255.255.255
    0     0 KUBE-SEP-ZWM4VW5XILR6J3V3  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-ZWM4VW5XILR6J3V3 side: source mask: 255.255.255.255

the first IP is currently unavailable and we detect error issues when a POD is starting: Get https://100.64.0.1:443/api?timeout=32s: dial tcp 100.64.0.1:443: i/o timeout

Is it possible to optimize the service and shrink the list to the available masters only?

7. Please provide your cluster manifest.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
spec:
  additionalPolicies:
    node: |
      [
      { "Action": "sts:AssumeRole", "Effect": "Allow", "Resource": "*" },
      { "Action": "ec2:AssociateAddress", "Effect": "Allow", "Resource": "*" },
      { "Action": "ec2:AttachVolume", "Effect": "Allow", "Resource": "*" },
      { "Action": "ec2:DetachVolume", "Effect": "Allow", "Resource": "*" },
      { "Action": "ec2:ModifyInstanceAttribute", "Effect": "Allow", "Resource": "*" }
      ]
  api:
    dns: {}
  authorization:
    alwaysAllow: {}
  certManager:
    enabled: true
  channel: stable
  cloudProvider: aws
  configBase: s3://kops-xxx/xxx
  containerRuntime: containerd
  dnsZone: xxx
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    volumeStatsAggPeriod: 0s
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.21.5
  masterInternalName: k8s.internal.xxx
  masterPublicName: k8s.xxx
  metricsServer:
    enabled: true
  networkCIDR: 172.31.0.0/16
  networkID: xxx
  networking:
    amazonvpc: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  subnets:
    ...
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

Thank you in advance for any help, Francesco

erismaster commented 2 years ago

Also seeing this after upgrading a 1.21 cluster to 1.22. Our 1.21 cluster has the expected number of endpoints, but the 1.22 cluster does indeed have additional endpoints. We are experiencing the same issue with timeouts which causes extremely slow pod startup time due network timeouts setting up the sandbox.

erismaster commented 2 years ago

@fvasco your issue above got me on the right track to fix my problem. I had been able to reproduce the timeouts but hadn't quite figured out why yet, as everything looked okay on the cluster and infra.

I had 5 control plane nodes listed in etcd, which is where the kubernetes default svc populates the endpoints.

During my upgrade from 1.21 to 1.22, the etcd upgrade was stuck going to 3.5 and I terminated all 3 masters and let them come back (effectively performing the etcd restore process). This allowed etcd to proceed but then I was hit with the random network failures.

Following the docs to find and delete the additional master leases in etcd resolved the issue for me.

fvasco commented 2 years ago

Happy to hear that, @erismaster, can you share with us some useful links or steps to delete the pending masters on our cluster? This can greatly improve our status.

Thank you again, Francesco

fvasco commented 2 years ago

I can confirm that we detected this issue after a failed upgrade from kops 1.21.2 to 1.22.1, a full rollback does not fix this for us.

In my memory, our endpoint list contains a lot of IPs from a while, but this was not an issue, or at least this was not a relevant problem.

This issue is pretty hard to find because the issues are pretty random, so changing CNI, availability zone or some other components can work or not work in a random fashion for each node, and working nodes can fail over time.

fvasco commented 2 years ago

The guide https://kops.sigs.k8s.io/operations/etcd_administration/ does not work for us, many pod command don't work.

We logged into the right container using

CONTAINER=$(kubectl get pods -n kube-system | grep etcd-manager-main | head -n 1 | awk '{print $1}')
kubectl exec -it -n kube-system $CONTAINER bash

then we delete pending IPs using command like:

# DIRNAME=/opt/etcd-v3.4.13-linux-amd64/etcd
# ETCDCTL_API=3
# alias etcdctl='$DIRNAME/etcdctl --cacert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --key=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.key --endpoints=https://127.0.0.1:4001'
# etcdctl get get --prefix /registry/masterleases
# etcdctl del /registry/masterleases/172.31.20.192

We keep looking at this page to understand if an upgrade to kops 1.22 is safe for us.

olemarkus commented 2 years ago

I suspect you are hitting https://github.com/kubernetes/kubernetes/issues/86812 This problem is also mentioned here: https://kops.sigs.k8s.io/operations/troubleshoot/#api-server-hangs-after-etcd-restore

We know this often happens when recovering from backup, but not seen this when upgrading etcd. Even though upgrading etcd is not too different from a backup restore.

I am afraid, there is not too much we can do from kOps side here.

fvasco commented 2 years ago

Is it possible to remove the master IP proactively during a cluster rolling-update? Any terminated master IP can be removed immediately from the pool, without waiting for timeout, just after the node removal.

olemarkus commented 2 years ago

I don't think that is exactly what's happening here. Upgrading etcd is something along the lines of "take a backup, create a new cluster with the backup, when all nodes are upgaded–hotswap the etcd cluster". So the backup probably has then the old leases, new apiservers write the new leases, and the old leases are not cleaned up.

So kOps could write something into kops-controller that cleans up old leases from etcd. But I really do not understand why they are considered valid... but the mentioned bug has been there for a long time without any real fix.

fvasco commented 2 years ago

We found these entries in etcd

# etcdctl get --prefix /registry/masterleases                                                                             
/registry/masterleases/172.31.20.192
k8s

v1      Endpoints)

▒"*28lBz

172.31.20.192▒▒"
/registry/masterleases/172.31.23.72
k8s

v1      Endpoints)

▒"*28�uBz

172.31.23.72▒▒"
/registry/masterleases/172.31.31.210
k8s

v1      Endpoints)

▒"*28Bz

172.31.31.210▒▒"

and so on, many of them are invalid. This bug affected our production environment for a week, master and nodes aren't able to communicate, health checks failed and our services were down. Handling this issue in kOps is a workaround, I understand, unfortunately, this issue is hard to detect and can hurt the entire cluster, so I hope can be appropriately considered.

olemarkus commented 2 years ago

What kind of lease TTL do you see on the incorrect items? Technically these should have timed out long ago.

fvasco commented 2 years ago

Yes @olemarkus, they should be expired days ago, we permanently deleted them, so I don't know how to check TTLs.

olemarkus commented 2 years ago

Can you add your input to the upstream issue. Meanwhile, I can try to reproduce this, but it may not be that easy.

fvasco commented 2 years ago

Thank you for your effort @olemarkus! As I said, it is possible it was an already present issue of previous upgrades.

However, I have to admit that I am not a Kubernetes Champion, so I reach to this issue using the hard way (digging into a Linux networking issue). For our experience, in absence of a better proposal, an informational message in kops validate cluster should be very helpful. A cluster with more Kubernetes endpoints than configured isn't properly valid.

So, I can imagine the rule as:

IF kubernetes endpoints count > configured master count PRINT "kubernetes" service contains too many endpoints, this can cause connection issues to kubernetes service host:port. If the problem persists, see permalink

Moreover, I suggest modifying the first phrase of API Server hangs after etcd restore to: After resizing an etcd cluster, restoring backup or updating kOps

Finally, if this issue can occur generally during kOps manual operations, a tool like kops toolbox cleanup-etcd can greatly improve the user experience.

voriol commented 2 years ago

i have the same issue, when i upgrade from 1.21 cluster to 1.22 my kubernetes endpoint have more servers than configured masters.

i want to apply the troubleshoot for delete the masterleases but my etcd-manager-main pod have different certificates configurated:

# ls /rootfs/etc/kubernetes/pki/                  
etcd-manager-events  etcd-manager-main

and the etcdctl connection have errors with this certificates:

# ./etcdctl --cacert=/rootfs/etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.crt --cert=/rootfs/etc/kubernetes/pki/etcd-manager-main/etcd-manager-client-etcd-a.crt --key=/rootfs/etc/kubernetes/pki/etcd-manager-main/etcd-manager-client-etcd-a.key --endpoints=https://127.0.0.1:4001 del --prefix /registry/masterleases/
{"level":"warn","ts":"2021-11-05T07:05:01.755Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002e0a80/#initially=[https://127.0.0.1:4001]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: authentication handshake failed: remote error: tls: bad certificate\""}

finally i found the kube-apiserver certificates in the control-plane machine and the troubleshoot works perfectly

fvasco commented 2 years ago

Hi, @olemarkus, any news on this issue?

johngmyers commented 2 years ago

Per Office Hours, will cut a new etcd-manager and cherrypick it to 1.22 branch.

fvasco commented 2 years ago

Great news @johngmyers, we hope it will fix!

Kristjanf-droid commented 2 years ago

Per Office Hours, will cut a new etcd-manager and cherrypick it to 1.22 branch.

Was this fixed in kops v.1.22.2?

voriol commented 2 years ago

In my case not, i need to apply the troubleshoot, but from the control-plane machine, not the etcd-manager-main pod, because the kube-apiserver certificates are not present in the pod (or i don't find it)

olemarkus commented 2 years ago

Yes, this should have been fixed in 1.22.2. At least known causes why this happens.

The kube-apiserver certificates are not present in the etcd-manager pods. But you can connect using the certificates should be in /etcd/kubernetes/pki.

simonccc commented 2 years ago

I've also spent several hours battling this problem yesterday - very happy to have found this thread but it took me a while to find the right certificates to use as I guess the naming has changed

I went into the etcd-manager-main pod - the certs were there for me in /etc/kubernetes/pki/etcd-manager

This was the right combination of paths and certs for me

alias etcdctl='$DIRNAME/etcdctl --cacert=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.crt --cert=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.crt --key=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.key --endpoints=https://127.0.0.1:4001'

olemarkus commented 2 years ago

If the troubleshooting docs on this is incorrect, can you do a PR to update them with the correct paths?

kubernetes / kops

Too many endpoints in the kubernetes service #12627