When running a workload with a single control plane node the load balancers take 15 mins to provision

jsturtevant commented 4 years ago

/kind bug

status (as of 6/10/21):

fixed in out of tree cloud provider in https://github.com/kubernetes-sigs/cloud-provider-azure/pull/537 (included in v0.7.3+)
cherry picked into in tree cloud provider in https://github.com/kubernetes/kubernetes/pull/100110 (included in k8s 1.22+)

What steps did you take and what happened: [A clear and concise description of what the bug is.] When running a workload with a single control plane node the load balancers take 15 mins to provision.

Add the following to the Creating a single control-plane cluster with 1 worker node e2e test after cluster creation:

AzureLBSpec(ctx, func() AzureLBSpecInput {
                    return AzureLBSpecInput{
                        BootstrapClusterProxy: bootstrapClusterProxy,
                        Namespace:             namespace,
                        ClusterName:           clusterName,
                        SkipCleanup:           skipCleanup,
                    }
                })

Run the e2e test and the test will fail:

./scripts/ci-e2e.sh

## commented out

Workload cluster creation                                                                                                                                                                           
[1] /home/jstur/projects/cluster-api-provider-azure/test/e2e/azure_test.go:36                                                                                                                           
[1]   Creating a single control-plane cluster                                                                                                                                                           
[1]   /home/jstur/projects/cluster-api-provider-azure/test/e2e/azure_test.go:71                                                                                                                         
[1]     With 1 worker node [It]                                                                                                                                                                         
[1]     /home/jstur/projects/cluster-api-provider-azure/test/e2e/azure_test.go:72                                                                                                                       
[1]                                                                                                                                                                                                     
[1]     Timed out after 180.000s.                                                                                                                                                                       
[1]     Service default/ingress-nginx-ilb failed to get an IP for LoadBalancer.Ingress                                                                                                                  
[1]     Expected                                                                                                                                                                                        
[1]         <bool>: false                                                                                                                                                                               
[1]     to be true                                                                                                                                                                                      
[1]                                                                                                                                                                                                     
[1]     /home/jstur/projects/cluster-api-provider-azure/test/e2e/helpers.go:97

If you connect to the workload cluster you will see the service with the load balancer is there and after 15 mins will provision. Subsequent services with load balancers will provision quickly. The logs of the controller manager will contain:

E0802 23:30:42.090814       1 azure_vmss.go:1116] EnsureHostInPool(default/ingress-nginx-ilb): backendPoolID(/subscriptions/b9d9436a-0c07-4fe8-b779-2c1030bd7997/resourceGroups/capz-e2e-72fll1/providers/Microsoft.Network/loadBalancers/capz-e2e-72fll1-internal/backendAddressPools/capz-e2e-72fll1) - failed to ensure host in pool: "not a vmss instance"

What did you expect to happen: That the tests should be able to provision a workload cluster and pass a e2e test that creates a load balancer.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

cluster-api-provider-azure version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

alexeldeib commented 4 years ago

Guess this is another code path related to not using Availability Sets? We should probably consider that as mitigation. Anything that tries to look up IDs will fail with our current setup. It's hard to track down all the places individually.

jsturtevant commented 4 years ago

It is related but not the root cause, The root cause is the cache used in the controller-manager. I provided more details in kubernetes-sigs/cloud-provider-azure#363.

I believe this could cause delays in a customer scenario where a node is added after the the cluster is provisioned causing a delay of the LB provisioning due the fact that the cache doesn't know about the node.

jsturtevant commented 3 years ago

I ran into this again trying to set up a single control plan test for windows. This appears to be an issue only in the VMAS sceario. There is VMSS tests that is a single node: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/cb486c3d7b26dc84ad5156fa97773ccb97578ebe/test/e2e/azure_test.go#L260-L261

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

CecileRobertMichon commented 3 years ago

/remove-lifecycle stale

lastcoolnameleft commented 3 years ago

FYI, I hit this issue today.

The only change I made from the default capi-quickstart.yaml was allocate-node-cidrs: "true" and version: v1.19.9 and then I installed an Ingress Controller via our Docs.

Let me know if you'd like for me to provide the full yaml.

The public IP is available now and the ingress works; however, as you can see from the logs, it took ~10 minutes.

I0511 17:15:50.069909       1 range_allocator.go:373] Set node capi-quickstart-md-0-9hhjr PodCIDR to [192.168.2.0/24]
E0511 17:15:53.063745       1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-grzqf exists: not a vmss instance
E0511 17:15:53.063832       1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-9hhjr exists: not a vmss instance
W0511 17:15:53.645025       1 node_lifecycle_controller.go:1044] Missing timestamp for Node capi-quickstart-md-0-9hhjr. Assuming now as a timestamp.
I0511 17:15:53.645236       1 event.go:291] "Event occurred" object="capi-quickstart-md-0-9hhjr" kind="Node" apiVersion="v1" type="Normal" reason="RegisteredNode" message="Node capi-quickstart-md-0-9hhjr event: Registered Node capi-quickstart-md-0-9hhjr in Controller"
E0511 17:15:58.064296       1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-9hhjr exists: not a vmss instance
E0511 17:15:58.064596       1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-grzqf exists: not a vmss instance
I0511 17:16:02.595945       1 route_controller.go:213] Created route for node capi-quickstart-md-0-grzqf 192.168.1.0/24 with hint 5f107155-a08a-44ac-8cb7-ead0da2e3a50 after 18.211955581s
I0511 17:16:02.595993       1 route_controller.go:303] Patching node status capi-quickstart-md-0-grzqf with true previous condition was:nil
....

I0511 17:23:59.859609       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: not a vmss instance"
I0511 17:26:39.859458       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0511 17:26:40.039313       1 azure_vmss.go:1229] EnsureHostInPool(ingress-basic/nginx-ingress-ingress-nginx-controller): backendPoolID(/subscriptions/df8428d4-bc25-4601-b458-1c8533ceec0b/resourceGroups/capi-quickstart/providers/Microsoft.Network/loadBalancers/capi-quickstart/backendAddressPools/capi-quickstart) - failed to ensure host in pool: "not a vmss instance"
E0511 17:26:40.039342       1 azure_vmss.go:1229] EnsureHostInPool(ingress-basic/nginx-ingress-ingress-nginx-controller): backendPoolID(/subscriptions/df8428d4-bc25-4601-b458-1c8533ceec0b/resourceGroups/capi-quickstart/providers/Microsoft.Network/loadBalancers/capi-quickstart/backendAddressPools/capi-quickstart) - failed to ensure host in pool: "not a vmss instance"
E0511 17:26:40.039367       1 azure_vmss.go:1229] EnsureHostInPool(ingress-basic/nginx-ingress-ingress-nginx-controller): backendPoolID(/subscriptions/df8428d4-bc25-4601-b458-1c8533ceec0b/resourceGroups/capi-quickstart/providers/Microsoft.Network/loadBalancers/capi-quickstart/backendAddressPools/capi-quickstart) - failed to ensure host in pool: "not a vmss instance"
E0511 17:26:40.039383       1 azure_loadbalancer.go:162] reconcileLoadBalancer(ingress-basic/nginx-ingress-ingress-nginx-controller) failed: not a vmss instance
E0511 17:26:40.039435       1 controller.go:275] error processing service ingress-basic/nginx-ingress-ingress-nginx-controller (will retry): failed to ensure load balancer: not a vmss instance
I0511 17:26:40.039740       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: not a vmss instance"
I0511 17:31:40.040301       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0511 17:31:52.158543       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"

lastcoolnameleft commented 3 years ago

Oh, I also tried installing the Flannel CNI, but I don't think that should have impacted it.

devigned commented 3 years ago

@lastcoolnameleft I think if you use the external cloud provider there are fixes available for this issue (see above in thread where #1216 is linked).

Rather than using the default template, you'd use --flavor external-cloud-provider.

As an aside, perhaps, we should use the out of tree provider by default...

CecileRobertMichon commented 3 years ago

@devigned @lastcoolnameleft the current version of external-cloud-provider we're using in the example template in CAPZ v0.4 doesn't have the fix yet unfortunately, the PR to bump the version (#1323) and enable the test that validates this behavior was blocked by another regression in cloud-provider, which is now released. You can work around it for now by editing your template to use version v0.7.4+ of cloud-provider, until we update the reference template.

The in-tree fix will be in k8s 1.22+.

Regarding using out of tree by default, v1.0.0 of out-of-tree provider just got released so it might be a good time that, tracking in #715

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/857#issuecomment-1079961476): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

shysank commented 2 years ago

/remove-lifecycle rotten

CecileRobertMichon commented 2 years ago

I think we can close this now. This was fixed in the external cloud provider v0.7.4+ and k8s 1.22+.

/close

k8s-ci-robot commented 2 years ago

@CecileRobertMichon: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/857#issuecomment-1081108867): >I think we can close this now. This was fixed in the external cloud provider v0.7.4+ and k8s 1.22+. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api-provider-azure

When running a workload with a single control plane node the load balancers take 15 mins to provision #857