karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.12k stars 807 forks source link

How to detect deviation from the baseline and alert based on it #4861

Closed kutsyk closed 1 day ago

kutsyk commented 1 week ago

Hi,

I would like to get some help/understanding on how would be the best way to solve our case.

We have set of propagated resources into N clusters.

What would be the best way to implement alert if 1 object in 1 of the clusters, was modified?

So we are alerted that that specific object, let's say it's a configmap, was modified and doesn't correspond to main configurations.

Thanks, Vasyl

XiShanYongYe-Chang commented 1 week ago

Hi @kutsyk Our current policy is that when a resource in a member cluster is modified, it is overwritten by the Karmada controller.

What kind of behavior do you expect?

kutsyk commented 1 week ago

Hi @XiShanYongYe-Chang ,

Rollback of change by karmada makes sense, but we need to identify the change that happened (if possible), with details what object, in which namespace and by whom?

As we have different security requirements, trail of changes is important to keep track of.

Additionally, is it possible to do a manual update of object and "pause" karmada rollback" on that object?

XiShanYongYe-Chang commented 1 week ago

but we need to identify the change that happened (if possible), with details what object, in which namespace and by whom?

Is it possible to use an event to indicate that a resource has been modified?

However, who caused the change may not be known to the Karmada control plane. I understand that the member cluster needs to handle it. For example, if a resource is controlled by Karmada(we can judge it with the labels ownered by karmada), changes to that resource need to be reported as events.

Additionally, is it possible to do a manual update of object and "pause" karmada rollback" on that object?

Maybe you can prevent the "karmada rollback" by the retain operation.

kutsyk commented 1 week ago

Hi @XiShanYongYe-Chang , thanks for prompt response.

yeah, events can works, but it is not as straight forward as would metrics indicate for example.

Cause I would expect as karmada knows already that propagated resource state has been modified, it would store it in metrics. I've enabled monitoring in my setup and would like to explore what metrics Karmada exposes, is there a comprehensive list of what metrics are exposes for karmada?

Retain operation seems as something that we need, but in my understanding it is currently implemented only for replicas, right?

Meaning I can't use it for other object types, as configmaps for example.

XiShanYongYe-Chang commented 1 week ago

Hi @kutsyk. What do you want the metrics to look like? Would you mind giving us an example? By the way, do you have a workable solution? If so, you can make a proposal so we can discuss it in PR faster.

Retain operation seems as something that we need, but in my understanding it is currently implemented only for replicas, right?

The retain action works with any resource, including configmaps. Maybe you can try it out and report back if you have problems. I'll see how I can help you.

kutsyk commented 1 week ago

Hi @XiShanYongYe-Chang ,

I don't have an example or workable solution with Karmada, but I can describe the use case:

  1. We have Karmada cluster that manages N clusters and M components in these N cluster
  2. We need to be alerted, if someone manually changes object that is propagated under one of M components

Note: I do get that Karmada will revert the change(propagate version that is configured on Karmada cluster), but if it happens that Karmada fails to propagate the change or cluster "refuses" to apply them, we need be notified about this

Q: How is it possible to implement this with Karmada?

Currently we using deprecated kubefed and our implementations in brief is quite inefficient, it goes as next:

  1. We gather events/logs in ES
  2. We scan these events/logs and based on conditions we trigger alert.

Reached retain action section in the documentation, going through it and testing, will give my feedback if it does what we need.

Thanks for your help, Vasyl

kutsyk commented 1 week ago

I have a few additional questions, maybe you can point me to proper doc or resource, please.

Same setup for karmada cluster and N manager clusters, but let me try describe better our use case.

In Karmada cluster I have have:

Deployment: configmap-logger ``` apiVersion: apps/v1 kind: Deployment metadata: name: configmap-logger namespace: karmada labels: app: karmada spec: replicas: 1 selector: matchLabels: app: configmap-logger template: metadata: labels: app: configmap-logger spec: containers: - name: logger image: busybox securityContext: runAsUser: 65534 command: ["/bin/sh", "-c"] args: - | while true; do sleep 10; cat /etc/config/config.json; cat /etc/config/example1; done volumeMounts: - name: config-volume mountPath: "/etc/config" volumes: - name: config-volume configMap: name: example-configmap ```
ConfigMap: `example-configmap` ``` apiVersion: v1 kind: ConfigMap metadata: name: example-configmap namespace: karmada data: config.json: | { "key1": "value1", "key2": "value2" } example1: | a example2: | b ```
PropagationPolicy: `aws-propagation-policy` ``` apiVersion: policy.karmada.io/v1alpha1 kind: PropagationPolicy metadata: name: aws-propagation-policy namespace: karmada spec: propagateDeps: true resourceSelectors: - apiVersion: apps/v1 kind: Deployment name: configmap-logger - apiVersion: v1 kind: ConfigMap name: example-configmap placement: clusterAffinity: labelSelector: matchLabels: provider: aws ```
OverridePolicy: `example-configmap-override` and OverridePolicy: `example-args-override`: ``` apiVersion: policy.karmada.io/v1alpha1 kind: OverridePolicy metadata: name: example-configmap-override namespace: karmada spec: resourceSelectors: - apiVersion: v1 kind: ConfigMap name: example-configmap overrideRules: - targetCluster: clusterNames: [ "cluster-name" ] overriders: plaintext: - path: "/data/example1" operator: replace value: "this_is_override_value" --- apiVersion: policy.karmada.io/v1alpha1 kind: OverridePolicy metadata: name: example-args-override namespace: karmada spec: resourceSelectors: - apiVersion: apps/v1 kind: Deployment name: configmap-logger overrideRules: - targetCluster: clusterNames: [ "cluster-name" ] overriders: commandOverrider: - containerName: logger operator: add value: - " cat /etc/config/example2;" ```

All these resources are propagated to cluster n1.

When I deleted example-args-override in Karmada cluster, I can see that configmap-logger in managed n1 cluster has not been updated and still contains override from example-args-override.

Here is what I see in Karmada:

kak -n karmada get overridepolicies                                                                                                                                              z4h recovery mode
NAME                         AGE
example-configmap-override   6d
Here is what I get in managed cluster: ``` (⎈ |minikube:karmada-system)  ~/ kt1 -n karmada get deployment.apps/configmap-logger -o yaml z4h recovery mode apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"service-directory.installation":"kubernetes-dev-vkutsyk-cfe1c3a6","service-directory.persona":"b-karmada-karmada","service-directory.project":"karmada","service-directory.rollout":"random-string","service-directory.service":"karmada"},"name":"configmap-logger","namespace":"karmada"},"spec":{"replicas":1,"selector":{"matchLabels":{"app":"configmap-logger"}},"template":{"metadata":{"labels":{"app":"configmap-logger"}},"spec":{"containers":[{"args":["while true; do\n sleep 10;\n cat /etc/config/config.json;\n cat /etc/config/example1;\ndone\n"],"command":["/bin/sh","-c"],"image":"busybox","name":"logger","securityContext":{"runAsUser":65534},"volumeMounts":[{"mountPath":"/etc/config","name":"config-volume"}]}],"volumes":[{"configMap":{"name":"example-configmap"},"name":"config-volume"}]}}}} propagationpolicy.karmada.io/name: aws-ec2-propagation-policy propagationpolicy.karmada.io/namespace: karmada resourcebinding.karmada.io/name: configmap-logger-deployment resourcebinding.karmada.io/namespace: karmada resourcetemplate.karmada.io/managed-annotations: kubectl.kubernetes.io/last-applied-configuration,propagationpolicy.karmada.io/name,propagationpolicy.karmada.io/namespace,resourcebinding.karmada.io/name,resourcebinding.karmada.io/namespace,resourcetemplate.karmada.io/managed-annotations,resourcetemplate.karmada.io/managed-labels,resourcetemplate.karmada.io/uid,work.karmada.io/conflict-resolution,work.karmada.io/name,work.karmada.io/namespace resourcetemplate.karmada.io/managed-labels: karmada.io/managed,propagationpolicy.karmada.io/name,propagationpolicy.karmada.io/namespace,propagationpolicy.karmada.io/permanent-id,resourcebinding.karmada.io/permanent-id resourcetemplate.karmada.io/uid: 7be5c835-2f15-447a-89df-6291a09d425c work.karmada.io/conflict-resolution: abort work.karmada.io/name: configmap-logger-5dbf5577d8 work.karmada.io/namespace: karmada-es-cluster-name creationTimestamp: "2024-04-19T10:00:57Z" generation: 7 labels: karmada.io/managed: "true" propagationpolicy.karmada.io/name: aws-ec2-propagation-policy propagationpolicy.karmada.io/namespace: karmada propagationpolicy.karmada.io/permanent-id: 8e01b556-b96f-4458-852d-7c8a2d2f5d33 resourcebinding.karmada.io/permanent-id: f188a07c-f483-49be-87ca-206c3d858d0b app: karmada work.karmada.io/permanent-id: 65f74c9e-715a-4c1b-8274-a512347d747d name: configmap-logger namespace: karmada resourceVersion: "761966834" uid: c05d9201-c362-4ba5-a04b-844bc7cecc3f spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: configmap-logger strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: app: configmap-logger spec: containers: - args: - | while true; do sleep 10; cat /etc/config/config.json; cat /etc/config/example1; done command: - /bin/sh - -c - ' cat /etc/config/example2;' image: busybox imagePullPolicy: Always name: logger resources: {} securityContext: runAsUser: 65534 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /etc/config name: config-volume dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 volumes: - configMap: defaultMode: 420 name: example-configmap name: config-volume status: conditions: - lastTransitionTime: "2024-04-25T12:17:15Z" lastUpdateTime: "2024-04-25T12:17:15Z" message: Deployment does not have minimum availability. reason: MinimumReplicasUnavailable status: "False" type: Available - lastTransitionTime: "2024-04-25T12:14:57Z" lastUpdateTime: "2024-04-25T12:28:55Z" message: ReplicaSet "configmap-logger-78c87f6785" is progressing. reason: ReplicaSetUpdated status: "True" type: Progressing observedGeneration: 7 replicas: 1 unavailableReplicas: 1 updatedReplicas: 1 ```

IMPORTANT: This should not be there as it is from overridepolicy that has been deleted:

        - '  cat /etc/config/example2;'

My questions:

  1. What am I missing for this override to be removed from managed cluster?
  2. What should I do before/after deletion of overridepolicy, so it get's removed from targeted clusters/objects?
  3. Is there a way to get notified that object deployment.apps/configmap-logger in namespace karmada is not corresponding to what should be?

Regarding retain operation and manual changes, I've experimented with a ConfigMap, basically I've manually modified label: karmada.io/managed: "true" –> karmada.io/managed: "false" and I was able to modify it's content without being overriden, this is behaviour that I would like(meaning be able to have full control over managed object).

But we get back to question #3 from list above:

  1. Is there a way to get notified that ConfgiMap object in namespace karmada is not propagated/managed to what should be?
XiShanYongYe-Chang commented 1 week ago

One of my ideas is to listen for changes in resources of member clusters in karmada-controller-manager. When a resource change is detected, an event or metric will be generated. But I don't know what this metric looks like. Do you have any ideas?

Ask for some ideas from those guys. /cc @RainbowMango @chaunceyjiang @whitewindmills

I have a few additional questions, maybe you can point me to proper doc or resource, please.

Let me take a look later.

whitewindmills commented 1 week ago

karmada.io/managed: "true" –> karmada.io/managed: "false"

@kutsyk I think you should not modify this label but implement it by the retain operation, cause this label is a built-in label of Karmada. And for your question 1 & 2. For now, you need to make any changes to the resource deployment.apps/configmap-logger to trigger an update after you deleted your overridepolicy. For example, just add a label. It's not very elegant anyway, so I think this is a problem that needs to be solved. cc @RainbowMango @XiShanYongYe-Chang @chaunceyjiang

When a resource change is detected, an event or metric will be generated.

@XiShanYongYe-Chang Yes, it's a good idea, work-status-controller can do that.

XiShanYongYe-Chang commented 1 week ago

Hi @kutsyk~ As @whitewindmills said, if you change the label karmada.io/managed: "true" –> karmada.io/managed: "false", the resource will be completely controlled by the member cluster itself, and the modification of the resource on karmada will not be synchronized to the cluster.

This modification is comparable to the subsequent abandonment of synchronization from the Karmada control plane.

However, if you still want to synchronize the modification of the resource on the Karmada control plane to the cluster, the preceding modification method is not applicable.

The retain operation allows users to customize the behavior: which fields of resources in the member cluster are not overwritten by the resource template from the Karmada control plane after being modified.

You can see some examples of this in the default code implementation of Karmada: https://github.com/karmada-io/karmada/blob/master/pkg/resourceinterpreter/default/native/retain.go

  1. Is there a way to get notified that ConfgiMap object in namespace karmada is not propagated/managed to what should be?

There's no feature that directly provides the capability that this points out. I think we can discuss designing and implementing this capability.

kutsyk commented 3 days ago

@whitewindmills , updating deployment.apps/configmap-logger didn't help with updating object to correct state.

I had to manually restart karmada-apiserver pod for this to work, which is really strange, so if I deleted override policy and it didn't work, how can I be sure that my target object are in desired state?

How can I use override policies if I'm not sure if it has been update/removed?

Let me try repeating these steps:

  1. Added override policy to Deployment
  2. Deploy it, check if it works
  3. Remove override policy

Expected outcome that my deployment after step 3 will be the same as before step 1.

I'll try and give you feedback

kutsyk commented 3 days ago

Okay, this is just strange as it doesn't do what it should at all.

Here is my propagation policy:

(⎈ |minikube:karmada-system)  ~/projects/ kak -n karmada get propagationpolicy aws-ec2-propagation-policy -o yaml                                                                                                          z4h recovery mode
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"policy.karmada.io/v1alpha1","kind":"PropagationPolicy","metadata":{"annotations":{},"name":"aws-ec2-propagation-policy","namespace":"karmada"},"spec":{"placement":{"clusterAffinity":{"labelSelector":{"matchLabels":{"provider":"aws-ec2"}}}},"propagateDeps":true,"resourceSelectors":[{"apiVersion":"apps/v1","kind":"Deployment","name":"configmap-logger"},{"apiVersion":"v1","kind":"ConfigMap","name":"example-configmap"}]}}
  creationTimestamp: "2024-04-19T10:00:57Z"
  generation: 4
  labels:
    propagationpolicy.karmada.io/permanent-id: 8e01b556-b96f-4458-852d-7c8a2d2f5d33
  name: aws-ec2-propagation-policy
  namespace: karmada
  resourceVersion: "629947"
  uid: 2dfd4d45-f774-4be2-bb35-38cf39d0c9cd
spec:
  conflictResolution: Abort
  placement:
    clusterAffinity:
      labelSelector:
        matchLabels:
          provider: aws-eks
    clusterTolerations:
    - effect: NoExecute
      key: cluster.karmada.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: cluster.karmada.io/unreachable
      operator: Exists
      tolerationSeconds: 300
  preemption: Never
  priority: 0
  propagateDeps: true
  resourceSelectors:
  - apiVersion: apps/v1
    kind: Deployment
    name: configmap-logger
    namespace: karmada
  - apiVersion: v1
    kind: ConfigMap
    name: example-configmap
    namespace: karmada
  schedulerName: default-scheduler

Is clearly states that Cluster should have label: provider: aws-eks.

Here is my Cluster object:

(⎈ |minikube:karmada-system)  ~/projects/ kak -n karmada get cluster my_cluster_name -o yaml                                                                                                         z4h recovery mode
apiVersion: cluster.karmada.io/v1alpha1
kind: Cluster
metadata:
  creationTimestamp: "2024-04-17T11:43:07Z"
  finalizers:
  - karmada.io/cluster-controller
  generation: 41
  labels:
    env: prod
    network_environment: prod
    provider: aws-ec2
    region: bk-eu-west6
    type: testing
  name: my_cluster_name
  resourceVersion: "630235"
  uid: 89fbdb5b-e644-48ad-92c5-fc2775f8f4e7
spec:
...

As you can see this is the label value for it: provider: aws-ec2, so propagation policy should not create any object.

Here is what I see in cluster:

(⎈ |minikube:karmada-system)  ~/projects/ kt1 -n karmada get all                                                                                                                                                           z4h recovery mode
NAME                                    READY   STATUS             RESTARTS      AGE
pod/configmap-logger-78c87f6785-pvwh9   0/1     CrashLoopBackOff   5 (64s ago)   4m

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/configmap-logger   0/1     1            0           4m1s

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/configmap-logger-78c87f6785   1         1         0       4m1s
(⎈ |minikube:karmada-system)  ~/projects/ kt1 -n karmada get deployment.apps/configmap-logger -o yaml                                                                                                                      z4h recovery mode
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    propagationpolicy.karmada.io/name: aws-ec2-propagation-policy
    propagationpolicy.karmada.io/namespace: karmada
    resourcebinding.karmada.io/name: configmap-logger-deployment
    resourcebinding.karmada.io/namespace: karmada
    resourcetemplate.karmada.io/managed-annotations: kubectl.kubernetes.io/last-applied-configuration,propagationpolicy.karmada.io/name,propagationpolicy.karmada.io/namespace,resourcebinding.karmada.io/name,resourcebinding.karmada.io/namespace,resourcetemplate.karmada.io/managed-annotations,resourcetemplate.karmada.io/managed-labels,resourcetemplate.karmada.io/uid,work.karmada.io/conflict-resolution,work.karmada.io/name,work.karmada.io/namespace
    resourcetemplate.karmada.io/managed-labels: karmada.io/managed,propagationpolicy.karmada.io/name,propagationpolicy.karmada.io/namespace,propagationpolicy.karmada.io/permanent-id,resourcebinding.karmada.io/permanent-id
    resourcetemplate.karmada.io/uid: 7be5c835-2f15-447a-89df-6291a09d425c
    work.karmada.io/conflict-resolution: abort
    work.karmada.io/name: configmap-logger-5dbf5577d8
    work.karmada.io/namespace: karmada-es-bplatform-t1-app-az1-bk-eu-west6-prod
  creationTimestamp: "2024-04-30T14:36:35Z"
  generation: 1
  labels:
    karmada.io/managed: "true"
    propagationpolicy.karmada.io/name: aws-ec2-propagation-policy
    propagationpolicy.karmada.io/namespace: karmada
    propagationpolicy.karmada.io/permanent-id: 8e01b556-b96f-4458-852d-7c8a2d2f5d33
    resourcebinding.karmada.io/permanent-id: f188a07c-f483-49be-87ca-206c3d858d0b

Here is what karmadactl shows:

(⎈ |minikube:karmada-system)  ~/projects/ kactl get all                                                                                                                                                                    z4h recovery mode
NAME                                    CLUSTER                                 READY   STATUS             RESTARTS        AGE
pod/configmap-logger-78c87f6785-pvwh9   bplatform-t1-app-az1-bk-eu-west6-prod   0/1     CrashLoopBackOff   5 (2m12s ago)   5m8s

NAME                               CLUSTER                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/configmap-logger   bplatform-t1-app-az1-bk-eu-west6-prod   0/1     1            0           5m8s

NAME                                          CLUSTER                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/configmap-logger-78c87f6785   bplatform-t1-app-az1-bk-eu-west6-prod   1         1         0       5m8s

Also somehow deployment manifest in target cluster is wrong and contains Deployment that still have override that I already deleted.

(⎈ |minikube:karmada-system)  ~/projects/ kactl describe deployment.apps/configmap-logger --cluster bplatform-t1-app-az1-bk-eu-west6-prod                                                                                  z4h recovery mode
Name:                   configmap-logger
Namespace:              karmada
CreationTimestamp:      Tue, 30 Apr 2024 16:36:35 +0200
Labels:                 karmada.io/managed=true
                        propagationpolicy.karmada.io/name=aws-ec2-propagation-policy
                        propagationpolicy.karmada.io/namespace=karmada
                        propagationpolicy.karmada.io/permanent-id=8e01b556-b96f-4458-852d-7c8a2d2f5d33
                        resourcebinding.karmada.io/permanent-id=f188a07c-f483-49be-87ca-206c3d858d0b
                        service-directory.installation=kubernetes-dev-vkutsyk-cfe1c3a6
                        service-directory.persona=b-karmada-karmada
                        service-directory.project=karmada
                        service-directory.rollout=random-string
                        service-directory.service=karmada
Annotations:            deployment.kubernetes.io/revision: 1
                        propagationpolicy.karmada.io/name: aws-ec2-propagation-policy
                        propagationpolicy.karmada.io/namespace: karmada
                        resourcebinding.karmada.io/name: configmap-logger-deployment
                        resourcebinding.karmada.io/namespace: karmada
                        resourcetemplate.karmada.io/managed-annotations:
                          kubectl.kubernetes.io/last-applied-configuration,propagationpolicy.karmada.io/name,propagationpolicy.karmada.io/namespace,resourcebinding....
                        resourcetemplate.karmada.io/managed-labels:
                          karmada.io/managed,propagationpolicy.karmada.io/name,propagationpolicy.karmada.io/namespace,propagationpolicy.karmada.io/permanent-id,reso...
                        resourcetemplate.karmada.io/uid: 7be5c835-2f15-447a-89df-6291a09d425c
                        work.karmada.io/conflict-resolution: abort
                        work.karmada.io/name: configmap-logger-5dbf5577d8
                        work.karmada.io/namespace: karmada-es-bplatform-t1-app-az1-bk-eu-west6-prod
Selector:               app=configmap-logger
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=configmap-logger
  Containers:
   logger:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
      -c
        cat /etc/config/example2;
    Args:
      while true; do
        sleep 10;
        cat /etc/config/config.json;
        cat /etc/config/example1;
      done

    Environment:  <none>
    Mounts:
      /etc/config from config-volume (rw)
  Volumes:
   config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      example-configmap
    Optional:  false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    True    ReplicaSetUpdated
OldReplicaSets:  <none>
NewReplicaSet:   configmap-logger-78c87f6785 (1/1 replicas created)
Events:
  Type    Reason             Age    From                   Message
  ----    ------             ----   ----                   -------
  Normal  ScalingReplicaSet  6m19s  deployment-controller  Scaled up replica set configmap-logger-78c87f6785 to 1

What is going on and how do I debug what is happening?

Don't want to delete all and recreate as it's not most efficient thing to do

kutsyk commented 3 days ago

Deletion and recreation from scratch resolved all the issues and fixed propagations

kutsyk commented 1 day ago

Closing this issue as no further question about this from my side. Moved monitoring topic into different issue for clear trail of issue and thoughts - https://github.com/karmada-io/karmada/issues/4895