karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.14k stars 812 forks source link

Drift detection and Continuous Reconciliation #4895

Open kutsyk opened 2 weeks ago

kutsyk commented 2 weeks ago

I'm trying to figure out how Karmada is solving problems of drift detection and continuous reconciliation.

It is not clear for me how to do next things:

  1. How to check if all propagation policies and override policies has been rolled out and applied correctly?
  2. How to detect if one of the managed cluster failed to apply propagation/override policy?
  3. What metrics to use to monitor is all propagated components work in all managed clusters?

I believe these 3 questions are not something new and most of people who are using orchestration tool should have already answered them, but I can't get my head around this with Karmada.

Thanks ahead for the help

XiShanYongYe-Chang commented 1 week ago

How to check if all propagation policies and override policies has been rolled out and applied correctly?

PP and OP need to be separated. For PP, it is available to view the FullyApplied condition status of ResourceBinding. For OP, it is necessary to actively check whether the differentiated configuration takes effect in the Work resource. In general, the current solution is still defective for automated checks, and status checks cannot be performed directly from the API.

How to detect if one of the managed cluster failed to apply propagation/override policy?

For PP and OP, the relationship between them and resources is only matched or not matched. When the matching is successful, they will be reflected in ResourceBinding and Work resources respectively.

Then, the resource template in the work is synchronized to the member cluster. You can view the synchronization result in the work status.

What metrics to use to monitor is all propagated components work in all managed clusters?

You can check the value of the health field in Work status:

https://github.com/karmada-io/karmada/blob/5e1191ffd774d969b293d1622f879062279720ed/pkg/apis/work/v1alpha1/work_types.go#L107

User can use the custom interpreter InterpretHealth to define this value.

RainbowMango commented 1 week ago

Just out of curiosity, are you evaluating Karmada? Do you have a schedule or something? We can see how to support you better.

kutsyk commented 1 week ago

Hi, @XiShanYongYe-Chang , thanks for the clarification.

@RainbowMango , yes, I'm evaluating the tool around set of points and to better understand how it works and if we should use it. We are at the end of our schedule and the last points I have to understand are those described in my initial question.

Seems there is a way to monitor statuses, but only through objects and values in their fields. There is no data exposed as metrics, do I understand this correctly?

RainbowMango commented 1 week ago

We have some metrics exposed by /metrics endpoint, but not include the items you want. I would say that metrics can be added at any time when there is a need.

kutsyk commented 1 week ago

@RainbowMango , @XiShanYongYe-Chang , do you have idea on how to identify if OverridePolicy failed and how to find reasons for it?

Also, what will happen if override policy that should be applied to 5 cluster fails on 2, how can I identify those clusters and reason for error?

Thanks

XiShanYongYe-Chang commented 1 week ago

Hi @kutsyk, oo you mean the op apply failure is caused by the op write problem or the failure of synchronization to the member cluster?

kutsyk commented 1 week ago

Hi,

The failure of synchronisation to the member cluster

Kutsyk Vasyl

On Wed, 8 May 2024 at 12:21, Chang @.***> wrote:

Hi @kutsyk https://github.com/kutsyk, oo you mean the op apply failure is caused by the op write problem or the failure of synchronization to the member cluster?

— Reply to this email directly, view it on GitHub https://github.com/karmada-io/karmada/issues/4895#issuecomment-2100251925, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSIWHTGF6LMVDYIX23HZVLZBH4ELAVCNFSM6AAAAABHASLMACVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBQGI2TCOJSGU . You are receiving this because you were mentioned.Message ID: @.***>

XiShanYongYe-Chang commented 1 week ago

In my opinion, if the resource synchronization fails, it is not easy to determine whether the failure is caused by the OP. In other words, the two are decoupled. The OP is used for the work resource, and the resource synchronization failure is a further operation on the work resource. Can you give an example of a failure caused by the OP?