Closed feiskyer closed 1 month ago
@feiskyer I realise this is an old issue, but, do we think this would still be an issue today if we removed the limit of 1 concurrent reconcile?
It seems to me, rather than limiting the concurrency, we could actually solve the issue and make sure that shared resources are aware of one another. I know for example that the GCP LB implementation has a sense of this, and checks for other services and makes sure that shared resources are only operator on by one thread at a time, meanwhile other operations that are known to not interfere across the service boundary, can continue in parallel.
If the Azure code were to take into account the multiple frontend IPs of services/the simultaneous creation issue, could this limit be removed?
@feiskyer Our team have been testing the current codebase, but removing the limitation of creating concurrent service syncs. So far in our testing, we have tried 5, 10 and 20 concurrent creates of load balancer, from no public load balancer at all, and have not been seeing any issues with any of the services being created incorrectly. The load balancer frontend IPs and rules are all being set correctly.
Is it possible in the last 5 years that improvements to the code have meant that this fix to 1 service sync is no longer required?
/reopen
@bridgetkromhout: Reopened this issue.
@feiskyer Please take a look at https://issues.redhat.com/browse/OCPBUGS-29012?focusedId=24121783&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-24121783 and re-evaluate this - thanks!
this is still required as multiple Azure resource (NSG, LB, public IP, VMSS VM) may be updated for a service, and each of would have ETAG to guard the consistency. Hence, even with concurrent services, we would have to make those Azure resource update operations sequentially to reduce the Azure requests (or else, more requests would be generated with HTTP 412 errors, and that may generate more issues around throttling).
In our scale testing we have seen no adverse effect or 412 conflict errors, even when creating 20 services simultaneously. Wonder if you could expand on why or when the etag 412 errors would appear?
With the etag, is the expectation that the etag is similar to a resource version in kube, where you must send the latest stored etag when sending an update?
Would we not expect that the Azure service controller would handle this 412 error anyway? Azure says to try again, this is an error, the CCM requeues the request and should be successful on the next attempt?
Reviewing some of the code again today, I can see that the code already accounts for the etag precondition failures. It seems the only issue here is that there may be some duplicated events sent to the Azure APIs. Given the frequency of updates to these kinds of resources, do we really expect those duplicated requests to cause an issue?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen
@feiskyer Did you get a chance to review my last comment?
@JoelSpeed: Reopened this issue.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
What happened:
Refer https://github.com/kubernetes/kubernetes/issues/84052#issuecomment-545215481.
After looking at logs provided by @MSSedusch, I think the root cause of the bug is the concurrent-service-syncs flag in your cloud-controller-manager. According to your log, it was set to 10, which means if there were multiple service-related changes queued up for reconciliation, they were processed concurrently.
In your case, since there was no existing Azure Load Balancer when the cluster was created, the two services were making two separate CreateOrUpdateLB requests to the Azure LB client concurrently. They were trying to create a load balancer with the same name but each request only contained one FrontendIPConfiguration. In the end, the latter request would overwrite the load balancer’s existing FrontendIPConfiguration, causing it to only have one FrontendIPConfiguration instead of two.
We should limit --concurrent-service-syncs to 1
What you expected to happen:
How to reproduce it:
Anything else we need to know?:
Environment:
kubectl version
):uname -a
):