Limit --concurrent-service-syncs to 1

feiskyer commented 4 years ago

What happened:

Refer https://github.com/kubernetes/kubernetes/issues/84052#issuecomment-545215481.

After looking at logs provided by @MSSedusch, I think the root cause of the bug is the concurrent-service-syncs flag in your cloud-controller-manager. According to your log, it was set to 10, which means if there were multiple service-related changes queued up for reconciliation, they were processed concurrently.

In your case, since there was no existing Azure Load Balancer when the cluster was created, the two services were making two separate CreateOrUpdateLB requests to the Azure LB client concurrently. They were trying to create a load balancer with the same name but each request only contained one FrontendIPConfiguration. In the end, the latter request would overwrite the load balancer’s existing FrontendIPConfiguration, causing it to only have one FrontendIPConfiguration instead of two.

We should limit --concurrent-service-syncs to 1

What you expected to happen:

How to reproduce it:

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

JoelSpeed commented 8 months ago

@feiskyer I realise this is an old issue, but, do we think this would still be an issue today if we removed the limit of 1 concurrent reconcile?

It seems to me, rather than limiting the concurrency, we could actually solve the issue and make sure that shared resources are aware of one another. I know for example that the GCP LB implementation has a sense of this, and checks for other services and makes sure that shared resources are only operator on by one thread at a time, meanwhile other operations that are known to not interfere across the service boundary, can continue in parallel.

If the Azure code were to take into account the multiple frontend IPs of services/the simultaneous creation issue, could this limit be removed?

JoelSpeed commented 8 months ago

@feiskyer Our team have been testing the current codebase, but removing the limitation of creating concurrent service syncs. So far in our testing, we have tried 5, 10 and 20 concurrent creates of load balancer, from no public load balancer at all, and have not been seeing any issues with any of the services being created incorrectly. The load balancer frontend IPs and rules are all being set correctly.

Is it possible in the last 5 years that improvements to the code have meant that this fix to 1 service sync is no longer required?

bridgetkromhout commented 8 months ago

/reopen

k8s-ci-robot commented 8 months ago

@bridgetkromhout: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-azure/issues/249#issuecomment-1944245038): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

bridgetkromhout commented 8 months ago

@feiskyer Please take a look at https://issues.redhat.com/browse/OCPBUGS-29012?focusedId=24121783&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-24121783 and re-evaluate this - thanks!

feiskyer commented 8 months ago

this is still required as multiple Azure resource (NSG, LB, public IP, VMSS VM) may be updated for a service, and each of would have ETAG to guard the consistency. Hence, even with concurrent services, we would have to make those Azure resource update operations sequentially to reduce the Azure requests (or else, more requests would be generated with HTTP 412 errors, and that may generate more issues around throttling).

JoelSpeed commented 7 months ago

In our scale testing we have seen no adverse effect or 412 conflict errors, even when creating 20 services simultaneously. Wonder if you could expand on why or when the etag 412 errors would appear?

With the etag, is the expectation that the etag is similar to a resource version in kube, where you must send the latest stored etag when sending an update?

Would we not expect that the Azure service controller would handle this 412 error anyway? Azure says to try again, this is an error, the CCM requeues the request and should be successful on the next attempt?

JoelSpeed commented 7 months ago

Reviewing some of the code again today, I can see that the code already accounts for the etag precondition failures. It seems the only issue here is that there may be some duplicated events sent to the Azure APIs. Given the frequency of updates to these kinds of resources, do we really expect those duplicated requests to cause an issue?

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-azure/issues/249#issuecomment-2254547425): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

JoelSpeed commented 2 months ago

/reopen

@feiskyer Did you get a chance to review my last comment?

k8s-ci-robot commented 2 months ago

@JoelSpeed: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-azure/issues/249#issuecomment-2255467258): >/reopen > >@feiskyer Did you get a chance to review my last comment? Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-azure/issues/249#issuecomment-2314853905): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cloud-provider-azure

Limit --concurrent-service-syncs to 1 #249