karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.45k stars 880 forks source link

Binding-status controller update workload's status error rate too high #4093

Closed zach593 closed 10 months ago

zach593 commented 1 year ago

Please provide an in-depth description of the question you have:

In our environment, we watched binding-status controller have 10% - 35% chance got 409(conflict) HTTP error when update workload resource's status, and it causes binding-status controller reconciliation error rate is also 10% - 35%.

Just fix the surface phenomenon is easy, there are many ways could fix this like use json patch instead of update(#4094 if you like), or get the newest object before update. But I'm still confused about what is the root cause.

Usually after we update workload spec once, binding-status controller will be triggered to update the binding and workload status 8 - 15 times, but the HTTP PUT 409 error rate of updating ResourceBinding is only 3% - 5%. I highly suspect the slow work of the informer of untyped objects, but it's hard to statistic the duration from successful update to watch update event. BTW this both happen in dev and prod environment, dev env only have very few objects.

And here's the question: Did anyone ever meet this? Does anyone have any research on this?

What do you think about this question?:

Environment:

XiShanYongYe-Chang commented 1 year ago

Hi @zach593 Thanks for your feedback! I'm also curious about the root cause, which I haven't researched before.

Updates to the status of resource templates are currently handled primarily by binding-status-controller, and the only events currently triggered by binding-status-controller are work updates, so I'd like to know how many clusters you're distributing that workload to, and how many work objects you're generating.

zach593 commented 1 year ago

In the dev environment, there are about 30 resource templates, and the conflict probability is 10% - 20%; in the prod environment, there are about 7000 resource templates, and the conflict probability is 20% - 35%

XiShanYongYe-Chang commented 1 year ago

Hi @zach593, thanks for your reply.

Sorry, I may not have made myself clear, I understand that it's the state collection of the same workload that affects the success rate of its updates, so I'm trying to understand how many clusters it will be distributed to, for the same workload.

zach593 commented 1 year ago

@XiShanYongYe-Chang Oops, I missed part of your question. Currently we only distribute these objects to 1 - 2 clusters. But there are also some objects that are only distributed to one cluster and trigger conflicts.

XiShanYongYe-Chang commented 1 year ago

@zach593 Thanks for your response, I'm sorry for replying late.

From the data you provided, the conflict overview doesn't feel like it should be so big, but we'll continue to analyze the root causes later, and for now we'll continue to move forward with the #4094 solution.

zach593 commented 1 year ago

@XiShanYongYe-Chang The reason I used JSONPatchType instead of MergePatchType is that I am worried that some fields may not be deleted when patch status, because we do not understand and cannot modify the definition of resource templates and the logic of related controllers. We even provide a webhook so that users can customize the status.

If you think jsonpatch is acceptable, I can move this part of the construction logic to helper/patch.go (I didn't notice this file existed before); or if you think merge is still a better choice, Then I will update this part.


should reply in https://github.com/karmada-io/karmada/pull/4094#discussion_r1363535452

XiShanYongYe-Chang commented 1 year ago

If you think jsonpatch is acceptable, I can move this part of the construction logic to helper/patch.go

+1

I think this will be good :)

RainbowMango commented 11 months ago

fix the surface phenomenon is easy, there are many ways to fix this like use json patch instead of update(https://github.com/karmada-io/karmada/pull/4094 if you like), or get the newest object before update. But I'm still confused about what is the root cause.

confuse +1 The conflict mentioned by this issue is due to the resource template being updated(.metadat.resourceVersion changed). We get the resource template from the informer cache and then update it after aggregating the status, the Karmada system unlikely to update resource template during this period.

I highly suspect the slow work of the informer of untyped objects, but it's hard to statistic the duration from successful update to watch update event.

+1 It is worth figuring out the root cause, if it is due to the informer, then we can consider optimizing the informer or do not rely on it.

Do you know how many resource templates can easily reproduce this problem?

RainbowMango commented 11 months ago

I made a sample and reproduced it with a single resource template.

I added some test code as follows( this commit):

diff --git a/pkg/controllers/status/rb_status_controller.go b/pkg/controllers/status/rb_status_controller.go
index 2f584da17..ecf34d634 100644
--- a/pkg/controllers/status/rb_status_controller.go
+++ b/pkg/controllers/status/rb_status_controller.go
@@ -3,8 +3,11 @@ package status
 import (
        "context"

+       "github.com/karmada-io/karmada/pkg/util/restmapper"
        apierrors "k8s.io/apimachinery/pkg/api/errors"
        "k8s.io/apimachinery/pkg/api/meta"
+       metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+       "k8s.io/apimachinery/pkg/runtime/schema"
        "k8s.io/apimachinery/pkg/types"
        "k8s.io/client-go/dynamic"
        "k8s.io/client-go/tools/record"
@@ -108,6 +111,20 @@ func (c *RBStatusController) syncBindingStatus(binding *workv1alpha2.ResourceBin
                return err
        }

+       // Just for test --------begin
+       gvr, err := restmapper.GetGroupVersionResource(c.RESTMapper, schema.FromAPIVersionAndKind(binding.Spec.Resource.APIVersion, binding.Spec.Resource.Kind))
+       if err != nil {
+               klog.Errorf("[JUSTFORDEBUG]Failed to get GVR from GVK(%s/%s), Error: %v", binding.Spec.Resource.APIVersion, binding.Spec.Resource.Kind, err)
+               panic("should not happen: transform gvk to GVR")
+       }
+       resourceTemplateFromAPIServer, err := c.DynamicClient.Resource(gvr).Namespace(binding.Spec.Resource.Namespace).Get(context.TODO(), binding.Spec.Resource.Name, metav1.GetOptions{})
+       if err != nil {
+               panic("should not happen: retrieve resource template from Karmada API server")
+       }
+       klog.Infof("[JUSTFORDEBUG]: resourceVersion from informer: %s, from APIServer: %s", resourceTemplate.GetResourceVersion(), resourceTemplateFromAPIServer.GetResourceVersion())
+
+       // Just for test --------end
+
        err = helper.AggregateResourceBindingWorkStatus(c.Client, binding, resourceTemplate, c.EventRecorder)
        if err != nil {
                klog.Errorf("Failed to aggregate workStatues to resourceBinding(%s/%s), Error: %v",

Some logs:

I1108 01:29:40.252622       1 common.go:115] Update resource(apps/v1, Resource=deployments/default/nginx) status successfully.
I1108 01:29:40.256077       1 rb_status_controller.go:124] [JUSTFORDEBUG]: resourceVersion from informer: 155715, from APIServer: 155743
E1108 01:29:40.276787       1 common.go:112] Failed to update resource(apps/v1, Resource=deployments/default/nginx), Error: Operation cannot be fulfilled on deployments.apps "nginx": the object has been modified; please apply your changes to the latest version and try again
E1108 01:29:40.276879       1 controller.go:324] "Reconciler error" err="Operation cannot be fulfilled on deployments.apps \"nginx\": the object has been modified; please apply your changes to the latest version and try again" controller="resourceBinding_status_controller" controllerGroup="work.karmada.io" controllerKind="ResourceBinding" ResourceBinding="default/nginx-deployment" namespace="default" name="nginx-deployment" reconcileID=c30ad358-96f0-4e47-b553-a662bacf814f
I1108 01:29:40.282756       1 rb_status_controller.go:48] Reconciling ResourceBinding default/nginx-deployment.
I1108 01:29:40.285228       1 rb_status_controller.go:124] [JUSTFORDEBUG]: resourceVersion from informer: 155743, from APIServer: 155743
I1108 01:29:40.285468       1 workstatus.go:69] New aggregatedStatuses are equal with old resourceBinding(default/nginx-deployment) AggregatedStatus, no update required.
I1108 01:29:40.295281       1 common.go:115] Update resource(apps/v1, Resource=deployments/default/nginx) status successfully.
I1108 01:29:40.296115       1 detector.go:309] Ignore update event of object (kind=Deployment, default/nginx) as specification no change

I1108 01:29:40.256077 1 rb_status_controller.go:124] [JUSTFORDEBUG]: resourceVersion from informer: 155715, from APIServer: 155743 E1108 01:29:40.276787 1 common.go:112] Failed to update resource(apps/v1, Resource=deployments/default/nginx), Error: Operation cannot be fulfilled on deployments.apps "nginx": the object has been modified; please apply your changes to the latest version and try again

This log clearly proves that the resource template obtained from informer is outdated(it is 155715, but should be 155743), which directly leads to update conflicts.

RainbowMango commented 11 months ago

I'm thinking maybe we following alternatives:

zach593 commented 10 months ago

Sorry to reply late, I didn't check github recently...

I'm not sure if this way could slow down the performance in scenario of large scale.

+1

My concern about this approach is patch always attempts to update the whole .status part which will drop any updates by another person/system.

In theory, when a resource is using karmada to distribute itself to member clusters, then its .status field is managed by karmada. It's little hard to imagine there's another controller also reacting to changes in the resource template objects’ specs and they both need to update the .status field.

RainbowMango commented 10 months ago

So, you mean, you tend to go with the #4094, right?

zach593 commented 10 months ago

Yes, it might be better.

zach593 commented 10 months ago

And the best approach is still to identify the root cause and solve it.