rke2 deployment gets stuck when deploying 100+ workers in parallel

vriabyk commented 2 years ago

Environmental Info: RKE2 Version:

rke2 version v1.23.6+rke2r2 (40d712e5081ac87e30e8f328f738130acf2c31f8) go version go1.17.5b7 also we tried it on rke2 version v1.22.7+rke2r2

Rancher Version: v2.6.6

rancher2 tf provider version: 1.24.0

Node(s) CPU architecture, OS, and Version:

Custom cluster type, Kubevirt Ubuntu VMs: Linux 5.4.0-122-generic #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

controlplane nodes: cpu = "8" memory = "32G"

etcd nodes: cpu = "4" memory = "16G"

worker nodes: cpu = "6" memory = "32G"

JFYI, we don't observe any highloads or any spikes on the nodes during deployment and afterwards.

Cluster Configuration:

Deploying via Rancher terraform provider: 3 controlplane nodes, 3 etcd and 100 workers

Describe the bug:

When we try to deploy rke2 cluster using Rancher and rancher2 terraform provider with 100 workers or more, the deployment gets stuck on configuring worker nodes: It manages to spawn and configure etcd, controlplane nodes and even add a few workers (looks like random nodes):

# kubectl get nodes -o wide
NAME                                           STATUS   ROLES                  AGE   VERSION          INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
production-island10-overcloud-controlplane-1   Ready    control-plane,master   84m   v1.23.6+rke2r2   192.168.0.48    <none>        Ubuntu 20.04.4 LTS   5.4.0-122-generic   containerd://1.5.11-k3s2
production-island10-overcloud-controlplane-2   Ready    control-plane,master   84m   v1.23.6+rke2r2   192.168.0.49    <none>        Ubuntu 20.04.4 LTS   5.4.0-122-generic   containerd://1.5.11-k3s2
production-island10-overcloud-controlplane-3   Ready    control-plane,master   84m   v1.23.6+rke2r2   192.168.0.50    <none>        Ubuntu 20.04.4 LTS   5.4.0-122-generic   containerd://1.5.11-k3s2
production-island10-overcloud-etcd-1           Ready    etcd                   84m   v1.23.6+rke2r2   192.168.0.16    <none>        Ubuntu 20.04.4 LTS   5.4.0-122-generic   containerd://1.5.11-k3s2
production-island10-overcloud-etcd-2           Ready    etcd                   84m   v1.23.6+rke2r2   192.168.0.17    <none>        Ubuntu 20.04.4 LTS   5.4.0-122-generic   containerd://1.5.11-k3s2
production-island10-overcloud-etcd-3           Ready    etcd                   84m   v1.23.6+rke2r2   192.168.0.18    <none>        Ubuntu 20.04.4 LTS   5.4.0-122-generic   containerd://1.5.11-k3s2
production-island10-overcloud-worker-21        Ready    worker                 81m   v1.23.6+rke2r2   192.168.0.148   <none>        Ubuntu 20.04.4 LTS   5.4.0-122-generic   containerd://1.5.11-k3s2
production-island10-overcloud-worker-24        Ready    worker                 81m   v1.23.6+rke2r2   192.168.0.151   <none>        Ubuntu 20.04.4 LTS   5.4.0-122-generic   containerd://1.5.11-k3s2
production-island10-overcloud-worker-54        Ready    worker                 81m   v1.23.6+rke2r2   192.168.0.181   <none>        Ubuntu 20.04.4 LTS   5.4.0-122-generic   containerd://1.5.11-k3s2
production-island10-overcloud-worker-85        Ready    worker                 81m   v1.23.6+rke2r2   192.168.0.212   <none>        Ubuntu 20.04.4 LTS   5.4.0-122-generic   containerd://1.5.11-k3s2
production-island10-overcloud-worker-9         Ready    worker                 81m   v1.23.6+rke2r2   192.168.0.136   <none>        Ubuntu 20.04.4 LTS   5.4.0-122-generic   containerd://1.5.11-k3s2

But all other worker nodes are stuck in configuring state in rancher. If we restart rke2 server on the node which is defined in rke2 agent configuration file, it may register several new workers.

For example, If we try to set 50 worker nodes - the deployment always completes successfully. Also we tried to add workers by portions, for example by 50 workers. It worked also fine and we were able to deploy 300 workers which is our target. We didn't try to deploy 60, 70 or 80 workers, so don't really know what actually the critical number to be deployed at one time in parallel.

Steps To Reproduce:

Deploy rke2 cluster with 100 or more workers in parallel via Rancher. We use rancher2 tf provider and remote exec to run curl from each node to trigger system-agent-install.sh, but I don't think this is the problem. You can add workers in any other method, the only requirement is to add such number of workers at one time(or almost at one time), so rke2 server should process them in parallel.

Expected behavior:

The cluster is getting deployed and becomes Active state in Rancher, all worker nodes are registered. Our target - at least 300 worker to be deployed in parallel.

Actual behavior:

As mentioned above, the deployment gets stuck on configuring the most part of worker nodes.

Additional context / logs:

At rke2 agent side, there are lots of records like this:

Aug  3 10:05:47 production-island10-overcloud-worker-48 rke2[1790]: time="2022-08-03T10:05:47Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

At rke2 server side logs for the node are:

Aug  3 10:05:20 production-island10-overcloud-controlplane-3 rke2[2020]: time="2022-08-03T10:05:20Z" level=info msg="certificate CN=production-island10-overcloud-worker-48 signed by CN=rke2-server-ca@1659518912: notBefore=2022-08-03 09:28:32 +0000 UTC notAfter=2023-08-03 10:05:20 +0000 UTC"
Aug  3 10:05:40 production-island10-overcloud-controlplane-3 rke2[2020]: time="2022-08-03T10:05:40Z" level=info msg="certificate CN=production-island10-overcloud-worker-48 signed by CN=rke2-server-ca@1659518912: notBefore=2022-08-03 09:28:32 +0000 UTC notAfter=2023-08-03 10:05:40 +0000 UTC"
Aug  3 10:06:00 production-island10-overcloud-controlplane-3 rke2[2020]: time="2022-08-03T10:06:00Z" level=info msg="certificate CN=production-island10-overcloud-worker-48 signed by CN=rke2-server-ca@1659518912: notBefore=2022-08-03 09:28:32 +0000 UTC notAfter=2023-08-03 10:06:00 +0000 UTC"

All rke2 agents try to connect to the first launched rke2 server node (in this case production-island10-overcloud-controlplane-3) and we see a lot of connections to 9345 port:

# netstat -anp | grep ':9345' -c
3759

# netstat -anp | grep 192.168.0.175 -c # ip belongs to production-island10-overcloud-worker-48 
37

# netstat -anp | grep 192.168.0.175 # ip belongs to production-island10-overcloud-worker-48 
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36252     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36104     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36372     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36280     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36392     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36340     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36260     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:35848     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36256     TIME_WAIT   -                   
tcp6       1      0 192.168.0.50:9345       192.168.0.175:36328     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36008     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36072     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36288     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36316     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36168     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36344     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36416     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36348     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36296     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36248     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:35976     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:57652     ESTABLISHED 2020/rke2 server    
tcp6       1      0 192.168.0.50:9345       192.168.0.175:36232     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:35880     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36376     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36200     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:35912     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36324     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:35944     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36292     TIME_WAIT   -                   
tcp6       1      0 192.168.0.50:9345       192.168.0.175:36264     CLOSE_WAIT  2020/rke2 server    
tcp6       1      0 192.168.0.50:9345       192.168.0.175:36136     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36424     ESTABLISHED 2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36388     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36360     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36356     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36284     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36352     TIME_WAIT   -                   
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36040     CLOSE_WAIT  2020/rke2 server    
tcp6       0      0 192.168.0.50:9345       192.168.0.175:36412     TIME_WAIT   -

For me it looks like rke2 server generates some kind of "registration" certificate for rke2 agent which is valid only 20 seconds (looking at rke2 server logs above, notAfter field changes on 20 sec every time). My feeling is when we deploy a huge number of nodes in parallel (like 100 or more), rke2 server doesn't manage to reply in time to rke2 agent requests and certificate gets expired and the client times out.

Any help or recommendation will be much appreciated!

brandond commented 2 years ago

Aug 3 10:06:00 production-island10-overcloud-controlplane-3 rke2[2020]: time="2022-08-03T10:06:00Z" level=info msg="certificate CN=production-island10-overcloud-worker-48 signed by CN=rke2-server-ca@1659518912: notBefore=2022-08-03 09:28:32 +0000 UTC notAfter=2023-08-03 10:06:00 +0000 UTC"

This log entry is for the RKE2 server issuing the agent node's kubelet certificates. These certificates are renewed every time the agent starts. Since you see it being issued repeatedly, I suspect that rke2 agent is crashing, restarting, and requesting new certificates, in a loop. What do the logs from the agent node show?

I might also suggest that you not try to add all the nodes in parallel, at the exact same time. There's no good reason to put that much load on the system at once. Stagger the nodes by 15 to 30 seconds at least so that you don't overwhelm the control-plane, datastore, and image registry.

vriabyk commented 2 years ago

I suspect that rke2 agent is crashing, restarting, and requesting new certificates, in a loop. What do the logs from the agent node show?

At rke2 agent side we see log like this:

Aug  3 10:05:47 production-island10-overcloud-worker-48 rke2[1790]: time="2022-08-03T10:05:47Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

The record is repeated every 10 seconds and rke2 agent is restarted after some number of attempts, but not after each. No more records in the logs.

I might also suggest that you not try to add all the nodes in parallel, at the exact same time.

It isn't actually adding 100 nodes in parallel, we do for_each in terraform through the list of nodes and run curl via ssh to execute system-agent-install.sh script. So basically there are some seconds difference between each node.

niusmallnan commented 2 years ago

Rancher has some controllers to ensure the execution of Node Provision, once the number of nodes exceeds the controller workers, it will take a long time to queue up. Currently, this value is 50 and cannot be changed.

https://github.com/rancher/rancher/blob/89fec60628df484c6a136e49ac5b49834df0b51d/pkg/wrangler/context.go#L167-L184

https://github.com/rancher/rancher/blob/release/v2.6/pkg/types/config/context.go

vriabyk commented 2 years ago

thank you for the reply @niusmallnan . Any chance/plan to make it a configuration option or increase to bigger value? What is the reason of having it in code?

brandond commented 2 years ago

Just to be clear, @vriabyk said:

We use rancher2 tf provider and remote exec to run curl from each node to trigger system-agent-install.sh

which means that I would not expect you to be subject to the limits of the node provisioning controller - since you're provisioning a custom cluster and creating / installing the nodes yourself. Is that correct @niusmallnan?

Aug 3 10:05:47 production-island10-overcloud-worker-48 rke2[1790]: time="2022-08-03T10:05:47Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

That just sounds like the server is currently overloaded and unable to respond to the client in a timely manner. Can you provide system load figures from the etcd and control-plane nodes while this is going on?

niusmallnan commented 2 years ago

@brandond I think the way @vriabyk is used means Launching Kubernetes on Existing Custom Nodes.

brandond commented 2 years ago

Right I don't think that's the limit though, since the system agent is successfully checking in with rancher and creating the node in the management cluster, and the system agent plan to install rke2 and join the cluster is getting created and run successfully. The hosts are getting stuck at the point where the rke2 agents are bootstrapping their kubelet, which is all independent from anything on the rancher side.

vriabyk commented 2 years ago

hi @brandond , as I mentioned above we don't observe any highloads or any spikes on the nodes during deployment and afterwards. We tried adding resources to controlplane and etcd nodes - no changes. When we set 50 workers - it works always fine, but when 100 - it adds only a few workers and other get stuck. As you can see number of connections to 9345 port is 3k+ and a lot of time_wait/close_wait. We checked network connectivity and latency - no issues, no packet loss. All nodes are in one broadcast L2 domain, so we are sure it is good.

I will try to test starting from what number of workers problems appear.

brandond commented 2 years ago

If that's the case then maybe you are in fact waiting on the cluster node controller as @niusmallnan suggested, and the other errors from the agents are just a red herring and not related to the 50-node limitation.

stale[bot] commented 1 year ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

Sea-you commented 1 year ago

/reopen

rancher / rke2

rke2 deployment gets stuck when deploying 100+ workers in parallel #3207