Open principekiss opened 2 years ago
EDIT
So the issue was that when a VM is part of an Availability Set, if another VM which is part of the same Availability Set is in a public Load Balancer backend, all VMs part of that Availability Set will use the LB public IP for outbound connection. But the problem is that the VM which is not part of the LB backend then cannot reach the internet because they cannot use the LB GW.
So we have to use a NAT GW attached to the subnet. That way, all VMs that are not part of the LB backend but are part of the Availability Set used for the LB can still have internet access using the NAT GW.
Normally, Rancher should first verify the LB is active once created (at the end of cluster creation), then add the nodes to it. But here the issue is also that first before the cluster gets created, all nodes are created and at that moment no lb exists. First, all nodes (masters and workers) are created then it starts installing stuff on masters, and only at the end, it starts creating the LB. so when the LB is created, the nodes already exist and they already started registering, installing stuff, etc. the first registered nodes then go inside the backend pool.
What happens is that the initial nodes start registering because they use the default Azure GW to access the internet (as all private nodes have internet access by default on azure for outbound connection and no nodes part of the same Availability Set is inside a public LB backend pool).
But as soon as the LB is active, it will start adding nodes to the backend AND when the first node using the LB backend Availability Set is added to the LB backend, all other nodes use the LB public IP without being added to the backend pool because rancher only adds registered nodes to the backend pool and for that, they need internet access to get their config and install stuff on it, etc.
This means, Rancher should actually wait for the LB to be active, and only at that moment, add FIRST nodes to the backend pool of the LB and then once they are part of the LB, start installing docker, etc and get their config from the rancher server. Otherwise, the VMs try to reach the internet using the LB public IP but they cannot use the LB GW without being in the backend.
So if we add a NAT GW on that subnet, they can use the NAT GW until they have been added to the LB backend pool, and at that moment they will use the LB GW instead of the NAT GW.
Rancher Server Setup
Information about the Cluster
User Information
Describe the bug When creating the downstream RKE cluster in Azure using node pools (1 master pool with etcd+control plane roles and 3 worker pools), the master gets created, then the load balancer is also created from the user addon. The master gets registered, and the 1st and sometimes the 2dn worker is also registered, but it is very likely that the load balancer is not active in the virtual network yet, giving the worker a gateway that works, allowing it to register.
Meanwhile, the load balancer finally gets active, and any new workers don't get the load balancer as the gateway, breaking their registration. The other worker nodes get stuck in "Registering" state and any added worker node through the Rancher UI scaling feature, gets stuck in "IP Resolved" until it times out and gets deleted.
So, logic would be that first of all, the load balancer should be created, and Rancher should actually wait/verify that it is active in the virtual network before it starts adding nodes.
And that is not being done, making me believe that there is a logic bug in Rancher itself.
To Reproduce
Result Only initial master nodes and first worker node is registered into the Kubernetes cluster. The other worker nodes get stuck in "Registering" state and no additional nodes can be added using the Rancher UI, they get stuck in "IP resolved".
Expected Result All nodes are registered and I can scale up nodes through the Rancher UI.
Screenshots
Additional context The following code uses terraform to create the downstream RKE1 cluster with 1 master node pool (control plane+etcd) and 3 worker pools (system, kafka, and general) and a user addon to create an external load balancer:
Addon used to expose the ingress controller using a cloud load balancer:
Information about nodes, pods, and services with the Rancher CLI
Provisioning Log for the cluster
DNS configuration All nodes have the same DNS config.
Rancher agent container logs Rancher agent logs of stuck node in "Registering" state.