Closed vriabyk closed 1 year ago
Aug 3 10:06:00 production-island10-overcloud-controlplane-3 rke2[2020]: time="2022-08-03T10:06:00Z" level=info msg="certificate CN=production-island10-overcloud-worker-48 signed by CN=rke2-server-ca@1659518912: notBefore=2022-08-03 09:28:32 +0000 UTC notAfter=2023-08-03 10:06:00 +0000 UTC"
This log entry is for the RKE2 server issuing the agent node's kubelet certificates. These certificates are renewed every time the agent starts. Since you see it being issued repeatedly, I suspect that rke2 agent is crashing, restarting, and requesting new certificates, in a loop. What do the logs from the agent node show?
I might also suggest that you not try to add all the nodes in parallel, at the exact same time. There's no good reason to put that much load on the system at once. Stagger the nodes by 15 to 30 seconds at least so that you don't overwhelm the control-plane, datastore, and image registry.
I suspect that rke2 agent is crashing, restarting, and requesting new certificates, in a loop. What do the logs from the agent node show?
At rke2 agent side we see log like this:
Aug 3 10:05:47 production-island10-overcloud-worker-48 rke2[1790]: time="2022-08-03T10:05:47Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
The record is repeated every 10 seconds and rke2 agent is restarted after some number of attempts, but not after each. No more records in the logs.
I might also suggest that you not try to add all the nodes in parallel, at the exact same time.
It isn't actually adding 100 nodes in parallel, we do for_each in terraform through the list of nodes and run curl via ssh to execute system-agent-install.sh script. So basically there are some seconds difference between each node.
Rancher has some controllers to ensure the execution of Node Provision, once the number of nodes exceeds the controller workers, it will take a long time to queue up. Currently, this value is 50 and cannot be changed.
https://github.com/rancher/rancher/blob/release/v2.6/pkg/types/config/context.go
thank you for the reply @niusmallnan . Any chance/plan to make it a configuration option or increase to bigger value? What is the reason of having it in code?
Just to be clear, @vriabyk said:
We use rancher2 tf provider and remote exec to run curl from each node to trigger system-agent-install.sh
which means that I would not expect you to be subject to the limits of the node provisioning controller - since you're provisioning a custom cluster and creating / installing the nodes yourself. Is that correct @niusmallnan?
Aug 3 10:05:47 production-island10-overcloud-worker-48 rke2[1790]: time="2022-08-03T10:05:47Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
That just sounds like the server is currently overloaded and unable to respond to the client in a timely manner. Can you provide system load figures from the etcd and control-plane nodes while this is going on?
@brandond I think the way @vriabyk is used means Launching Kubernetes on Existing Custom Nodes.
Right I don't think that's the limit though, since the system agent is successfully checking in with rancher and creating the node in the management cluster, and the system agent plan to install rke2 and join the cluster is getting created and run successfully. The hosts are getting stuck at the point where the rke2 agents are bootstrapping their kubelet, which is all independent from anything on the rancher side.
hi @brandond , as I mentioned above we don't observe any highloads or any spikes on the nodes during deployment and afterwards. We tried adding resources to controlplane and etcd nodes - no changes. When we set 50 workers - it works always fine, but when 100 - it adds only a few workers and other get stuck. As you can see number of connections to 9345 port is 3k+ and a lot of time_wait/close_wait. We checked network connectivity and latency - no issues, no packet loss. All nodes are in one broadcast L2 domain, so we are sure it is good.
I will try to test starting from what number of workers problems appear.
If that's the case then maybe you are in fact waiting on the cluster node controller as @niusmallnan suggested, and the other errors from the agents are just a red herring and not related to the 50-node limitation.
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.
/reopen
Environmental Info: RKE2 Version:
rke2 version v1.23.6+rke2r2 (40d712e5081ac87e30e8f328f738130acf2c31f8) go version go1.17.5b7 also we tried it on rke2 version v1.22.7+rke2r2
Rancher Version: v2.6.6
rancher2 tf provider version: 1.24.0
Node(s) CPU architecture, OS, and Version:
Custom cluster type, Kubevirt Ubuntu VMs: Linux 5.4.0-122-generic #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
controlplane nodes: cpu = "8" memory = "32G"
etcd nodes: cpu = "4" memory = "16G"
worker nodes: cpu = "6" memory = "32G"
JFYI, we don't observe any highloads or any spikes on the nodes during deployment and afterwards.
Cluster Configuration:
Deploying via Rancher terraform provider: 3 controlplane nodes, 3 etcd and 100 workers
Describe the bug:
When we try to deploy rke2 cluster using Rancher and rancher2 terraform provider with 100 workers or more, the deployment gets stuck on configuring worker nodes: It manages to spawn and configure etcd, controlplane nodes and even add a few workers (looks like random nodes):
But all other worker nodes are stuck in configuring state in rancher. If we restart rke2 server on the node which is defined in rke2 agent configuration file, it may register several new workers.
For example, If we try to set 50 worker nodes - the deployment always completes successfully. Also we tried to add workers by portions, for example by 50 workers. It worked also fine and we were able to deploy 300 workers which is our target. We didn't try to deploy 60, 70 or 80 workers, so don't really know what actually the critical number to be deployed at one time in parallel.
Steps To Reproduce:
Deploy rke2 cluster with 100 or more workers in parallel via Rancher. We use rancher2 tf provider and remote exec to run curl from each node to trigger system-agent-install.sh, but I don't think this is the problem. You can add workers in any other method, the only requirement is to add such number of workers at one time(or almost at one time), so rke2 server should process them in parallel.
Expected behavior:
The cluster is getting deployed and becomes Active state in Rancher, all worker nodes are registered. Our target - at least 300 worker to be deployed in parallel.
Actual behavior:
As mentioned above, the deployment gets stuck on configuring the most part of worker nodes.
Additional context / logs:
At rke2 agent side, there are lots of records like this:
At rke2 server side logs for the node are:
All rke2 agents try to connect to the first launched rke2 server node (in this case production-island10-overcloud-controlplane-3) and we see a lot of connections to 9345 port:
For me it looks like rke2 server generates some kind of "registration" certificate for rke2 agent which is valid only 20 seconds (looking at rke2 server logs above, notAfter field changes on 20 sec every time). My feeling is when we deploy a huge number of nodes in parallel (like 100 or more), rke2 server doesn't manage to reply in time to rke2 agent requests and certificate gets expired and the client times out.
Any help or recommendation will be much appreciated!