Closed principekiss closed 2 years ago
This morning, running the exact same code (used git status
to verify), 2 worker pools registered successfully and 1 got stuck with the following status:
Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitilazed
Yesterday, with a new entire Rancher cluster and downstream RKE cluster creation, I've seen the same behaviors with the downstream cluster agents above and I was able to retrieve logs from the Rancher cluster controller pods with TRACE log level:
These lines are about the stuck nodes, system1
and general1
, of the downstream cluster:
2022/08/22 14:45:36 [TRACE] dialerFactory: Skipping node [system1] for tunneling the cluster connection because nodeConditions are not as expected
2022/08/22 14:45:36 [TRACE] dialerFactory: Skipping node [general1] for tunneling the cluster connection because nodeConditions are not as expected
And in the last line of this screenshot (coming from the Rancher cluster controller logs I pasted above)
I added the stuck worker nodes to the load balance backend pool manually, all of them got registered.
To me, this is look like a Rancher bug. Even after all nodes are registered, the scaling of nodes does not work because Rancher does not ask Azure to add nodes to the load balancer backend pool once they are created which makes the added nodes stuck in a "Provisioning" state. I assume that the network security groups are correct, I checked the open ports on the nodes.
This is because Rancher only adds the first worker node to the load balancer backend pool after the cluster is successfully created to meet the minimum requirement of a cluster with 1 worker node and the initial etcd+control plane nodes. After the cluster is active (as soon as the first worker node is registered) it stops adding other worker nodes in the load balancer backend pool and so, no scaling is possible when using an external load balancer.
I should be able to use the user-addon to create a service type LoadBalancer
pointing to the ingress controller, have all my nodes registered, and node scaling working.
I believe Kubernetes is responsible for adding nodes to the backend pool, but it can only do that if the Azure cloud provider is enabled properly.
I'm not an expert in this area, but I was looking up the docs about the Azure cloud provider for RKE https://rancher.com/docs/rke/latest/en/config-options/cloud-providers/azure/#overriding-the-hostname and searching for anything that might be missing from your configuration. I found this in the linked doc:
Since the Azure node name must match the Kubernetes node name, you override the Kubernetes name on the node by setting the hostname_override for each node.
Maybe the problem is that the hostname_override has not been set.
hostname_override
Hi, thanks for your answer! I already looked at this in the doc previously but that argument is not present in the rancher2 Terraform provider for the node_template
resource and I haven't found any equivalent.
The hostname_override
is in the cluster resource https://registry.terraform.io/providers/rancher/rancher2/latest/docs/resources/cluster#hostname_override . Basically, that value would have been configured in the cluster config file if you were provisioning the RKE cluster with the RKE CLI. And the cluster resource in terraform is the equivalent of that file
The
hostname_override
is in the cluster resource https://registry.terraform.io/providers/rancher/rancher2/latest/docs/resources/cluster#hostname_override. Basically, that value would have been configured in the cluster config file if you were provisioning the RKE cluster with the RKE CLI. And the cluster resource in terraform is the equivalent of that file
Indeed, but I use node templates and node pools. I do not define nodes in the rancher2_cluster
resource. I use node templates/pools for being able to use node scaling in the Rancher UI.
If I deploy the external load balancer only after all nodes are registered and the cluster is active, it adds all of them to the backend pool but then scaling does not work because it only adds nodes to the backend pool after they are registered. It should add the nodes directly after they are created not after they are registered because they only can be registered when they are added to the load balancer backend.
Rancher Server Setup
Information about the Cluster
User Information
Describe the bug When creating the downstream RKE cluster in Azure using node pools (1 master pool with etcd+control plane roles and 3 worker pools), the master gets created, then the load balancer is also being created from the user addon. The master gets registered, and added to the load balancer backend pool. Then the 1st worker is also registered, but it is very likely that the load balancer is not active in the virtual network yet, giving the worker a gateway that works, allowing it to register.
Meanwhile, the load balancer finally gets active, and any new workers don't get the load balancer as the gateway, breaking their registration. The other worker nodes get stuck in "Registering" state and any added node (masters and/or workers) through the Rancher UI scaling feature, gets stuck in "IP Resolved" until it times out and gets deleted.
So, logic would be that first of all, the load balancer should be created, and Rancher should actually wait/verify that it is active in the virtual network before it starts adding nodes (masters and/or workers).
And that is not being done, making me believe that there is a logic bug in Rancher itself.
To Reproduce
Result Only initial master nodes and first worker node is registered into the Kubernetes cluster. The other worker nodes get stuck in "Registering" state and no additional nodes can be added using the Rancher UI, they get stuck in "IP resolved".
Expected Result All nodes are registered and I can scale up nodes through the Rancher UI.
Screenshots
Additional context The following code uses terraform to create the downstream RKE1 cluster with 1 master node pool (control plane+etcd) and 3 worker pools (system, kafka, and general) with an external load balancer from user addon job:
Addon used to expose the ingress controller using a cloud load balancer:
Getting nodes and pods with the Rancher CLI:
Provisioning Log for the cluster
/etc/resolv.conf
All nodes have the same DNS config.
rancher-agent container logs
Rancher agent logs of stuck nodes in "Registering" state.