Protokube Gossip causing network congestion

KetchupBomb commented 3 years ago

1. What kops version are you running? The command kops version, will display this information.

$ kops version
Version 1.15.1

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

$ kubectl version --short
Client Version: v1.20.4-dirty
Server Version: v1.12.10

3. What cloud provider are you using?

AWS.

4. What commands did you run? What is the simplest way to reproduce this issue?

Protokube is containerized, attached to each node's host network. Some Docker inspection is provided, if more is needed please ask (some information redacted):

$ sudo docker inspect $(sudo docker ps | grep -i protokube | awk '{print $1}') | jq '.[] | .State, .Config.Cmd'
{
  "Status": "running",
  "Running": true,
  "Paused": false,
  "Restarting": false,
  "OOMKilled": false,
  "Dead": false,
  "Pid": 6048,
  "ExitCode": 0,
  "Error": "",
  "StartedAt": "2021-03-12T06:03:59.498290524Z",
  "FinishedAt": "0001-01-01T00:00:00Z"
}
[
  "/usr/bin/protokube",
  "--channels=s3://k8s/production.k8s.local/addons/bootstrap-channel.yaml",
  "--cloud=aws",
  "--containerized=true",
  "--dns-internal-suffix=internal.production.k8s.local",
  "--dns=gossip",
  "--etcd-backup-store=s3://k8s/production.k8s.local/backups/etcd/main",
  "--etcd-image=k8s.gcr.io/etcd:3.3.13",
  "--initialize-rbac=true",
  "--manage-etcd=true",
  "--master=false",
  "--peer-ca=/srv/kubernetes/ca.crt",
  "--peer-cert=/srv/kubernetes/etcd-peer.pem",
  "--peer-key=/srv/kubernetes/etcd-peer-key.pem",
  "--tls-auth=true",
  "--tls-ca=/srv/kubernetes/ca.crt",
  "--tls-cert=/srv/kubernetes/etcd.pem",
  "--tls-key=/srv/kubernetes/etcd-key.pem",
  "--v=4"
]

This reported issue started happening recently, after our cluster scaled up to >1,000 nodes. We believe it's a scaling issue, so reproducing is difficult.

5. What happened after the commands executed?

The container starts. It runs, opening connections to all other nodes, and all other nodes connect to it. Traffic is passed, and masters are discovered, persisted to /etc/hosts.

Within <10 minutes, we start to see cluster degradation in terms of TCP receive windows becoming full, TCP resets, and data transfer with RTT >10ms starting to slow.

6. What did you expect to happen?

Protokube starts and performs its function as expected, but the eventual state of nodes in the cluster become degraded, causing other network traffic to be affected.

We do not expect this to happen -- It caused RTT=80ms (or higher) network operations to slow down from 30 MB/sec to 300 KB/sec.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

This issue is happening across a few clusters, consisting of >1k, large nodes. Specific numbers redacted.

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

# ~60ms RTT
$ ping -c 3 public-dev-test-delete-me.s3-us-west-2.amazonaws.com
PING s3-us-west-2-r-w.amazonaws.com (52.218.229.113) 56(84) bytes of data.
64 bytes from s3-us-west-2-r-w.amazonaws.com (52.218.229.113): icmp_seq=1 ttl=40 time=63.8 ms
64 bytes from s3-us-west-2-r-w.amazonaws.com (52.218.229.113): icmp_seq=2 ttl=40 time=62.6 ms
64 bytes from s3-us-west-2-r-w.amazonaws.com (52.218.229.113): icmp_seq=3 ttl=40 time=62.6 ms

# Protokube is running
$ sudo docker ps | grep -i protokube ; pgrep protokube
d1ad7cbdef2a        protokube:1.15.1                                                             "/usr/bin/protokube …"   4 days ago          Up 4 days                               musing_proskuriakova
6048

# Downloading a file is slow
$ curl https://public-dev-test-delete-me.s3-us-west-2.amazonaws.com/testfile > /dev/null
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
--
Dload  Upload   Total   Spent    Left  Speed
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
15 45.6M   15 7277k    0     0   333k      0  0:02:19  0:00:21  0:01:58  347k^C

# Stopping Protokube causes the previous download to now move quickly
$ sudo docker pause $(sudo docker ps | grep -i protokube | awk '{print $1}')
$ curl https://public-dev-test-delete-me.s3-us-west-2.amazonaws.com/testfile > /dev/null
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
--
Dload  Upload   Total   Spent    Left  Speed
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 76.6M  100 76.6M    0     0  32.2M      0  0:00:02  0:00:02 --:--:-- 34.6M

9. Anything else do we need to know?

There is a third-party blog post describing our similar experiences in a nearly identical manner.

olemarkus commented 3 years ago

This is a tricky one ...

The kOps version you are using is somewhat old, and newer versions will have updated e.g mesh libraries. Trying to reproduce something with 1000+ nodes is also somewhat hard for us maintainers.

There is also https://github.com/kubernetes/kops/issues/7427 which led to https://github.com/kubernetes/kops/pull/7521 The milestone says 1.15, but was actually released in 1.16.

Also be aware that 1.16 upgrade had an issue wrt gossip: https://github.com/kubernetes/kops/issues/8771

KetchupBomb commented 3 years ago

Agreed on both points that we're running an older version of kops, and that reproducing with such cluster sizes is inaccessible to most. We are in process of upgrading kops to ~1.18, and we have long-term plans to front masters with a VIP, since this is our primary use case of Protokube (rather than paying the cost for Gossip across >1k nodes just for master discovery).

We mainly wanted to formally flag this since we couldn't find an open or closed Github Issue. And if anyone had some immediate mitigation input, we'd welcome it. (For now, we're increasing TCP buffer size to mitigate.)

olemarkus commented 3 years ago

I'd try to upgrade to 1.16+ and see if the alternative mesh implementation improves things.

How do you plan on implementing VIP for the masters on AWS?

Worth to mention that in upcoming versions of kops there are several other important domains used for scaling the control plane, such as dedicated API server nodes and etcd. This is especially useful for scaling larger clusters.

KetchupBomb commented 3 years ago

@olemarkus: I'd try to upgrade to 1.16+ and see if the alternative mesh implementation improves things.

Yep, we're planning on going to 1.18. Optimistically, this will fix the Gossip issues, but even if it doesn't we're re-thinking our master-discovery mechanism.

@olemarkus: How do you plan on implementing VIP for the masters on AWS?

Still evaluating options, and I'm ramping up on Kops, but I'm interested in chasing the following options, roughly in order:

Layer 7 or layer 4 ELB in front of auto-scaling groups which enforce master instance counts.
A DNS solution where resolving the name will either answer with:
- n A records, 1 for each master, or
- 1 A record per query via round-robin between masters.
Assign an IP to each master and leverage ECMP to load balance.
Configure VRRP/keepalived to implement a proper VIP.

I'll come back to this issue and comment with a summary of what we end up doing.

olemarkus commented 3 years ago

Thanks.

Layer 7 load balancing is a bit tricky because of TLS, Layer 4 LB or DNS is what kOps uses if you don't use gossip.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

kubernetes / kops

Protokube Gossip causing network congestion #11064