kubernetes-sigs / cluster-api-provider-azure

Cluster API implementation for Microsoft Azure
https://capz.sigs.k8s.io/
Apache License 2.0
293 stars 423 forks source link

2 seconds is too short for azure api call #4276

Closed MartinForReal closed 9 months ago

MartinForReal commented 10 months ago

2 seconds is too short for azure api call. https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/fc261f0df8cfe4f783039d8c182c616d18843685/util/reconciler/defaults.go#L33

NSG subnet route table requests got cancelled because context expired.

https://[storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cloud-provider-azure/4997/pull-cloud-provider-azure-e2e-capz/1724631304445628416/artifacts/clusters/bootstrap/controllers/capz-controller-manager/capz-controller-manager-69487dcc9f-4pzvp/manager.log](https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cloud-provider-azure/4997/pull-cloud-provider-azure-e2e-capz/1724631304445628416/artifacts/clusters/bootstrap/controllers/capz-controller-manager/capz-controller-manager-69487dcc9f-4pzvp/manager.log)

https://[storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cloud-provider-azure/4968/pull-cloud-provider-azure-e2e-capz/1724664448301404160/artifacts/clusters/bootstrap/controllers/capz-controller-manager/capz-controller-manager-69487dcc9f-z7b5v/manager.log](https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cloud-provider-azure/4968/pull-cloud-provider-azure-e2e-capz/1724664448301404160/artifacts/clusters/bootstrap/controllers/capz-controller-manager/capz-controller-manager-69487dcc9f-z7b5v/manager.log)

troy0820 commented 10 months ago

/kind bug support

nojnhuh commented 10 months ago

That timeout is used for an individual reconciliation, so e.g. if CAPZ starts to create a NSG, it will wait 2s in that same reconciliation for the operation to finish. If it doesn't finish, CAPZ will save a handle to that ongoing operation, the resource will be requeued, and CAPZ will check that same ongoing operation using the handle in the next reconciliation. This allows CAPZ to handle Azure operations that take much longer than that default timeout. Do you see any evidence that this isn't behaving that way?

The only context-related errors I see in the logs seem transient since the CAPZ resources do eventually report being ready.

(Also links to the full runs for my own convenience) https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cloud-provider-azure/4997/pull-cloud-provider-azure-e2e-capz/1724631304445628416 https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cloud-provider-azure/4968/pull-cloud-provider-azure-e2e-capz/1724664448301404160

CecileRobertMichon commented 10 months ago

see also: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/docs/proposals/20210716-async-azure-resource-creation-deletion.md