fix: `MultiClusterService` placement change triggering recreation of resources

a7i commented 1 month ago

What type of PR is this?

/king bug

What this PR does / why we need it: Karmada MultiClusterService uses an unsored list of clusters which can change order over time (set has no order guarantee). This should normally be ok, but karmda-scheduler does a reflect.DeepEqual here as opposed to checking the content of the list.

My first reaction was to change karmada-scheduler to not do this, but perhaps the order of clusters has a significance. But in a MultiClusterService provider/consumer, it does not have any significance as long as it's available.

Currently we get in a loop of Service updates / Endpoint deletion/recreation because karmada-scheduler keeps rescheduling as it thinks placement has changed:

1 scheduler.go:348] Start to schedule ResourceBinding(test/istio-mcs-test-service) as placement changed
22:12:23.207608       1 scheduler.go:348] Start to schedule ResourceBinding(test/istio-mcs-test-service) as placement changed
22:12:23.221188       1 event.go:307] "Event occurred" object="test/istio-mcs-test-service" fieldPath="" kind="ResourceBinding" apiVersion="work.karmada.io/v1alpha2" type="Normal" reason="ScheduleBindingSucceed" message="Binding has been scheduled successfully."
22:12:23.221337       1 event.go:307] "Event occurred" object="test/istio-mcs-test" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="ScheduleBindingSucceed" message="Binding has been scheduled successfully."
22:12:43.696352       1 scheduler.go:348] Start to schedule ResourceBinding(test/istio-mcs-test-service) as placement changed
22:12:43.707379       1 event.go:307] "Event occurred" object="test/istio-mcs-test-service" fieldPath="" kind="ResourceBinding" apiVersion="work.karmada.io/v1alpha2" type="Normal" reason="ScheduleBindingSucceed" message="Binding has been scheduled successfully."
22:12:43.707407       1 event.go:307] "Event occurred" object="test/istio-mcs-test" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="ScheduleBindingSucceed" message="Binding has been scheduled successfully."
22:12:53.628098       1 scheduler.go:348] Start to schedule ResourceBinding(test/istio-mcs-test-service) as placement changed
22:12:53.637153       1 event.go:307] "Event occurred" object="test/istio-mcs-test-service" fieldPath="" kind="ResourceBinding" apiVersion="work.karmada.io/v1alpha2" type="Normal" reason="ScheduleBindingSucceed" message="Binding has been scheduled successfully."
22:12:53.637178       1 event.go:307] "Event occurred" object="test/istio-mcs-test" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="ScheduleBindingSucceed" message="Binding has been scheduled successfully."
22:12:54.091134       1 scheduler.go:348] Start to schedule ResourceBinding(test/istio-mcs-test-service) as placement changed
22:12:54.103214       1 event.go:307] "Event occurred" object="test/istio-mcs-test-service" fieldPath="" kind="ResourceBinding" apiVersion="work.karmada.io/v1alpha2" type="Normal" reason="ScheduleBindingSucceed" message="Binding has been scheduled successfully."
22:12:54.103250       1 event.go:307] "Event occurred" object="test/istio-mcs-test" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="ScheduleBindingSucceed" message="Binding has been scheduled successfully."
22:13:35.800578       1 scheduler.go:348] Start to schedule ResourceBinding(test/istio-mcs-test-service) as placement changed

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

`karmada-controller-manager`: Fixed the issue that `multiclusterservice-controller` generates unordered placement leading to unnecessary rescheduling.

codecov-commenter commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 51.68%. Comparing base (b5045c5) to head (bf4ec8a). Report is 17 commits behind head on master.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #4712 +/- ## ========================================== + Coverage 51.40% 51.68% +0.27% ========================================== Files 250 247 -3 Lines 24979 24777 -202 ========================================== - Hits 12840 12805 -35 + Misses 11433 11266 -167 Partials 706 706 ``` | [Flag](https://app.codecov.io/gh/karmada-io/karmada/pull/4712/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=karmada-io) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/karmada-io/karmada/pull/4712/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=karmada-io) | `51.68% <ø> (+0.27%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=karmada-io#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

RainbowMango commented 1 month ago

/lgtm /assign @XiShanYongYe-Chang

@a7i nice finding! Thank you!

Currently we get in a loop of Service updates / Endpoint deletion/recreation because karmada-scheduler keeps rescheduling as it thinks placement has changed:

I understand this patch can avoid unnecessary rescheduling, but I don't get why this causes the endpoint deletion. In my opinion, the scheduler result should remain consistent between each loop of re-scheduling. @jwcesign Can you help to confirm that?

RainbowMango commented 1 month ago

This should normally be ok, but karmda-scheduler does a reflect.DeepEqual here as opposed to checking the content of the list.

By the way. The reflect.DeepEqual here is a short patch, checks the cluster affinity here: https://github.com/karmada-io/karmada/blob/2bebae0701c3b6c21a08230d19c02b4d0e84690d/pkg/scheduler/helper.go#L56 But it still can not check the content of list.

XiShanYongYe-Chang commented 1 month ago

Let me take a look.

a7i commented 1 month ago

Perhaps it's the way we're using MultiClusterService, we want bidirectional network connectivity so both clusters are in consumers and providers:

For example:

---
apiVersion: networking.karmada.io/v1alpha1
kind: MultiClusterService
metadata:
  name: istio-mcs-test
spec:
  types:
    - CrossCluster
  serviceProvisionClusters:
    - member1
    - member2
  serviceConsumptionClusters:
    - member1
    - member2

What we observed is that the Service is created in member2, then deleted, then created. Which causes the member cluster's kube-controller-manager (endpoint-controller) to delete the Endpoint and recreate again. (please assume that the masked cluster name is member2)

We will be debugging again tomorrow to identify any other issues with our setup or code.

XiShanYongYe-Chang commented 1 month ago

I understand that this patch saves some rescheduling. ~It's a cleanup, not a bug fix. What do you think?~ @a7i

RainbowMango commented 1 month ago

I think this is a bug, that would bring unnecessary rescheduling which would confuse users.

RainbowMango commented 1 month ago

And we should backport this to previous releases as well.

XiShanYongYe-Chang commented 1 month ago

/kind bug

karmada-bot commented 1 month ago

New changes are detected. LGTM label has been removed.

karmada-bot commented 1 month ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please ask for approval from rainbowmango after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[pkg/controllers/OWNERS](https://github.com/karmada-io/karmada/blob/master/pkg/controllers/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

a7i commented 1 month ago

unfortunately, this did not solve the issue so still investigating

XiShanYongYe-Chang commented 1 month ago

unfortunately, this did not solve the issue so still investigating

I'm getting involved too.

XiShanYongYe-Chang commented 1 month ago

unfortunately, this did not solve the issue so still investigating

Hi @a7i, any progress?

Or is there a guide for me to do so that I can reproduce it exactly?

a7i commented 1 month ago

unfortunately, this did not solve the issue so still investigating

Hi @a7i, any progress?

Or is there a guide for me to do so that I can reproduce it exactly?

Hi @XiShanYongYe-Chang My colleague @SerenaTiede-Zen and I will link up this week and provide the steps

XiShanYongYe-Chang commented 3 weeks ago

Hi @SerenaTiede-Zen If the related issue has not been reproduced, can we move forward with this PR based on the current code? The current revision is also a meaningful one.

XiShanYongYe-Chang commented 3 weeks ago

Hi @SerenaTiede-Zen If the related issue has not been reproduced, can we move forward with this PR based on the current code? The current revision is also a meaningful one.

/cc @a7i

karmada-bot commented 3 weeks ago

@XiShanYongYe-Chang: GitHub didn't allow me to request PR reviews from the following users: a7i.

Note that only karmada-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to [this](https://github.com/karmada-io/karmada/pull/4712#issuecomment-2051308740): >> Hi @SerenaTiede-Zen If the related issue has not been reproduced, can we move forward with this PR based on the current code? The current revision is also a meaningful one. > >/cc @a7i Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

a7i commented 2 weeks ago

Hi @XiShanYongYe-Chang , I explained our setup a bit in this Issue

@SerenaTiede-Zen and I paired on this today and the issue seems to be resolved on the latest code base. Our assumption is this change resolved it: https://github.com/karmada-io/karmada/pull/4818

XiShanYongYe-Chang commented 2 weeks ago

Our assumption is this change resolved it: #4818

Glad to hear the problem has been resolved, although I still don't understand what the problem was.

a7i commented 2 weeks ago

I will do my best to debug further next week and do a write up for your review

karmada-io / karmada

fix: `MultiClusterService` placement change triggering recreation of resources #4712

Codecov Report