Closed jsturtevant closed 2 years ago
Guess this is another code path related to not using Availability Sets? We should probably consider that as mitigation. Anything that tries to look up IDs will fail with our current setup. It's hard to track down all the places individually.
It is related but not the root cause, The root cause is the cache used in the controller-manager. I provided more details in kubernetes-sigs/cloud-provider-azure#363.
I believe this could cause delays in a customer scenario where a node is added after the the cluster is provisioned causing a delay of the LB provisioning due the fact that the cache doesn't know about the node.
I ran into this again trying to set up a single control plan test for windows. This appears to be an issue only in the VMAS sceario. There is VMSS tests that is a single node: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/cb486c3d7b26dc84ad5156fa97773ccb97578ebe/test/e2e/azure_test.go#L260-L261
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
FYI, I hit this issue today.
The only change I made from the default capi-quickstart.yaml was allocate-node-cidrs: "true"
and version: v1.19.9
and then I installed an Ingress Controller via our Docs.
Let me know if you'd like for me to provide the full yaml.
The public IP is available now and the ingress works; however, as you can see from the logs, it took ~10 minutes.
I0511 17:15:50.069909 1 range_allocator.go:373] Set node capi-quickstart-md-0-9hhjr PodCIDR to [192.168.2.0/24]
E0511 17:15:53.063745 1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-grzqf exists: not a vmss instance
E0511 17:15:53.063832 1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-9hhjr exists: not a vmss instance
W0511 17:15:53.645025 1 node_lifecycle_controller.go:1044] Missing timestamp for Node capi-quickstart-md-0-9hhjr. Assuming now as a timestamp.
I0511 17:15:53.645236 1 event.go:291] "Event occurred" object="capi-quickstart-md-0-9hhjr" kind="Node" apiVersion="v1" type="Normal" reason="RegisteredNode" message="Node capi-quickstart-md-0-9hhjr event: Registered Node capi-quickstart-md-0-9hhjr in Controller"
E0511 17:15:58.064296 1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-9hhjr exists: not a vmss instance
E0511 17:15:58.064596 1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-grzqf exists: not a vmss instance
I0511 17:16:02.595945 1 route_controller.go:213] Created route for node capi-quickstart-md-0-grzqf 192.168.1.0/24 with hint 5f107155-a08a-44ac-8cb7-ead0da2e3a50 after 18.211955581s
I0511 17:16:02.595993 1 route_controller.go:303] Patching node status capi-quickstart-md-0-grzqf with true previous condition was:nil
....
I0511 17:23:59.859609 1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: not a vmss instance"
I0511 17:26:39.859458 1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0511 17:26:40.039313 1 azure_vmss.go:1229] EnsureHostInPool(ingress-basic/nginx-ingress-ingress-nginx-controller): backendPoolID(/subscriptions/df8428d4-bc25-4601-b458-1c8533ceec0b/resourceGroups/capi-quickstart/providers/Microsoft.Network/loadBalancers/capi-quickstart/backendAddressPools/capi-quickstart) - failed to ensure host in pool: "not a vmss instance"
E0511 17:26:40.039342 1 azure_vmss.go:1229] EnsureHostInPool(ingress-basic/nginx-ingress-ingress-nginx-controller): backendPoolID(/subscriptions/df8428d4-bc25-4601-b458-1c8533ceec0b/resourceGroups/capi-quickstart/providers/Microsoft.Network/loadBalancers/capi-quickstart/backendAddressPools/capi-quickstart) - failed to ensure host in pool: "not a vmss instance"
E0511 17:26:40.039367 1 azure_vmss.go:1229] EnsureHostInPool(ingress-basic/nginx-ingress-ingress-nginx-controller): backendPoolID(/subscriptions/df8428d4-bc25-4601-b458-1c8533ceec0b/resourceGroups/capi-quickstart/providers/Microsoft.Network/loadBalancers/capi-quickstart/backendAddressPools/capi-quickstart) - failed to ensure host in pool: "not a vmss instance"
E0511 17:26:40.039383 1 azure_loadbalancer.go:162] reconcileLoadBalancer(ingress-basic/nginx-ingress-ingress-nginx-controller) failed: not a vmss instance
E0511 17:26:40.039435 1 controller.go:275] error processing service ingress-basic/nginx-ingress-ingress-nginx-controller (will retry): failed to ensure load balancer: not a vmss instance
I0511 17:26:40.039740 1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: not a vmss instance"
I0511 17:31:40.040301 1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0511 17:31:52.158543 1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"
Oh, I also tried installing the Flannel CNI, but I don't think that should have impacted it.
@lastcoolnameleft I think if you use the external cloud provider there are fixes available for this issue (see above in thread where #1216 is linked).
Rather than using the default template, you'd use --flavor external-cloud-provider
.
As an aside, perhaps, we should use the out of tree provider by default...
@devigned @lastcoolnameleft the current version of external-cloud-provider we're using in the example template in CAPZ v0.4 doesn't have the fix yet unfortunately, the PR to bump the version (#1323) and enable the test that validates this behavior was blocked by another regression in cloud-provider, which is now released. You can work around it for now by editing your template to use version v0.7.4+ of cloud-provider, until we update the reference template.
The in-tree fix will be in k8s 1.22+.
Regarding using out of tree by default, v1.0.0 of out-of-tree provider just got released so it might be a good time that, tracking in #715
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
/remove-lifecycle rotten
I think we can close this now. This was fixed in the external cloud provider v0.7.4+ and k8s 1.22+.
/close
@CecileRobertMichon: Closing this issue.
/kind bug
status (as of 6/10/21):
What steps did you take and what happened: [A clear and concise description of what the bug is.] When running a workload with a single control plane node the load balancers take 15 mins to provision.
Add the following to the
Creating a single control-plane cluster with 1 worker node
e2e test after cluster creation:Run the e2e test and the test will fail:
If you connect to the workload cluster you will see the service with the load balancer is there and after 15 mins will provision. Subsequent services with load balancers will provision quickly. The logs of the controller manager will contain:
What did you expect to happen: That the tests should be able to provision a workload cluster and pass a e2e test that creates a load balancer.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
This is related to https://github.com/kubernetes-sigs/cloud-provider-azure/issues/338
Environment:
kubectl version
):/etc/os-release
):