ResourceBinding displays abnormally after being evicted multiple times

huangyutongs commented 9 months ago

Please provide an in-depth description of the question you have: When I release a new version, due to unexpected reasons, the new version of the Pod in the member1 cluster cannot be ready within PropagationPolicy.spec.failover.application.decisionConditions.tolerationSeconds: 120 seconds, triggering a failover operation to transfer the Pod to the member2 cluster. The new version of the pod in the member2 cluster also failed to be ready within 120 seconds. ResourceBinding always displayed message: '0/2 clusters are available: 2 cluster(s) is in the process of eviction.', I don't know PropagationPolicy.spec.failover. Is the application.decisionConditions.tolerationSeconds field like I understand it, I do see pods being deleted and created repeatedly in both clusters, after which they stabilize, but the ResourceBinding status is not right.

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  labels:
    argocd.argoproj.io/instance: rps-spring-boot-idc-hyper-sit-a
    propagationpolicy.karmada.io/permanent-id: d7d225f1-63ae-4885-88dd-53a91db03699
  name: rps-spring-boot
  namespace: scmp-a
spec:
  conflictResolution: Overwrite
  failover:
    application:
      decisionConditions:
        tolerationSeconds: 120
      gracePeriodSeconds: 120
      purgeMode: Graciously
  placement:
    clusterTolerations:
      - effect: NoExecute
        key: cluster.karmada.io/not-ready
        operator: Exists
        tolerationSeconds: 120
      - effect: NoExecute
        key: cluster.karmada.io/unreachable
        operator: Exists
        tolerationSeconds: 120
    replicaScheduling:
      replicaDivisionPreference: Weighted
      replicaSchedulingType: Divided
      weightPreference:
        staticWeightList:
          - targetCluster:
              clusterNames:
                - idc-hyper-sit-1
            weight: 1
          - targetCluster:
              clusterNames:
                - idc-hyper-sit-2
            weight: 1
  preemption: Never
  priority: 0
  propagateDeps: true
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      labelSelector:
        matchLabels:
          app.kubernetes.io/instance: rps-spring-boot
      namespace: scmp-a
    - apiVersion: v1
      kind: Service
      labelSelector:
        matchLabels:
          app.kubernetes.io/instance: rps-spring-boot
      namespace: scmp-a
    - apiVersion: apisix.apache.org/v2
      kind: ApisixRoute
      labelSelector:
        matchLabels:
          app.kubernetes.io/instance: rps-spring-boot
      namespace: scmp-a
  schedulerName: default-scheduler

resourcebinding SCHEDULED is False

I deploy using helm chart v1.8.1，values.yaml has the following components enabled components: [ "schedulerEstimator", "descheduler", "search" ]

worke Resources are normal

I have two clusters

karmada pod is running normally

karmada-scheduler enables the --enable-scheduler-estimator=true function

Notice that karmada-descheduler has the following error message, and other karmada program logs have no obvious error message. kubectl logs -f --tail 30 -n karmada-system -l app=karmada-descheduler

I1227 07:02:47.962519       1 reflector.go:788] pkg/generated/informers/externalversions/factory.go:122: Watch close - *v1alpha2.ResourceBinding total 10 items received
E1227 07:02:50.218092       1 cache.go:113] Failed to dial cluster(idc-hyper-sit-2): dial karmada-scheduler-estimator-idc-hyper-sit-2:10352 error: context deadline exceeded.
I1227 07:02:50.221640       1 cache.go:110] Start dialing estimator server(karmada-scheduler-estimator-idc-hyper-sit-2:10352) of cluster(idc-hyper-sit-2).
E1227 07:02:55.222453       1 cache.go:113] Failed to dial cluster(idc-hyper-sit-2): dial karmada-scheduler-estimator-idc-hyper-sit-2:10352 error: context deadline exceeded.
I1227 07:03:05.291859       1 descheduler.go:245] Receiving update event for cluster idc-hyper-sit-2
I1227 07:03:05.294743       1 cache.go:110] Start dialing estimator server(karmada-scheduler-estimator-idc-hyper-sit-2:10352) of cluster(idc-hyper-sit-2).
I1227 07:03:05.340887       1 descheduler.go:245] Receiving update event for cluster idc-hyper-sit-2
E1227 07:03:10.297026       1 cache.go:113] Failed to dial cluster(idc-hyper-sit-2): dial karmada-scheduler-estimator-idc-hyper-sit-2:10352 error: context deadline exceeded.
I1227 07:03:10.299812       1 cache.go:110] Start dialing estimator server(karmada-scheduler-estimator-idc-hyper-sit-2:10352) of cluster(idc-hyper-sit-2).
E1227 07:03:15.300242       1 cache.go:113] Failed to dial cluster(idc-hyper-sit-2): dial karmada-scheduler-estimator-idc-hyper-sit-2:10352 error: context deadline exceeded.
I1227 07:03:18.247173       1 cache.go:110] Start dialing estimator server(karmada-scheduler-estimator-idc-hyper-sit-1:10352) of cluster(idc-hyper-sit-1).
E1227 07:03:23.248047       1 cache.go:113] Failed to dial cluster(idc-hyper-sit-1): dial karmada-scheduler-estimator-idc-hyper-sit-1:10352 error: context deadline exceeded.
I1227 07:03:56.878909       1 reflector.go:788] pkg/generated/informers/externalversions/factory.go:122: Watch close - *v1alpha1.Cluster total 32 items received

What do you think about this question?: It feels like the controller is not retrying multiple times to check status Environment:

Karmada version: v1.8.1
Kubernetes version: v1.24.17+rke2r1
Others: Deploy resources based on the multi-cluster CI/CD example https://karmada.io/zh/docs/userguide/cicd/working-with-argocd

zhzhuang-zju commented 9 months ago

Hi, in order to understand this issue better, can you answer a few questions before?

Is the status of resourcebinding continuously displayed abnormally, or does it display normally after a period of time?
You showed the log of component karmada-descheduler. Since there is no date, I am not sure whether it is related to this phenomenon. Can you describe the specific date of this log?
Can you show the logs of karmada-controller-manager near where the phenomenon occurs? Because the state is its responsibility. It should be noted that it has two instances, and its leader instance can be found through the command kubectl get Lease -A

huangyutongs commented 9 months ago

Hi, in order to understand this issue better, can you answer a few questions before?

Is the status of resourcebinding continuously displayed abnormally, or does it display normally after a period of time?

You showed the log of component karmada-descheduler. Since there is no date, I am not sure whether it is related to this phenomenon. Can you describe the specific date of this log?

Can you show the logs of karmada-controller-manager near where the phenomenon occurs? Because the state is its responsibility. It should be noted that it has two instances, and its leader instance can be found through the command kubectl get Lease -A

Thanks for your reply, answer you now,

The status of resourcebinding continues to display abnormally.
The karmada-descheduler component log is not the log when it occurs, Later I found that it kept outputting logs.

Some potentially useful log output at the end of karmada-controller-manager

I1227 03:41:43.103307 1 recorder.go:104] "events: Update resourceBinding(scmp-a/rps-yhportal-management-deployment) with AggregatedStatus successfully." type="Normal" object={Kind:ResourceBinding Namespace:scmp -a Name:rps-yhportal-management-deployment UID:4eb13939-59df-4b4d-988e-8c752d814172 APIVersion:work.karmada.io/v1alpha2 ResourceVersion:959002 FieldPath:} reason="AggregateStatusSucceed"
I1227 03:41:43.103335 1 recorder.go:104] "events: Update resourceBinding(scmp-a/rps-yhportal-management-deployment) with AggregatedStatus successfully." type="Normal" object={Kind:Deployment Namespace:scmp -a Name:rps-yhportal-management UID:aa48cdf6-098e-4a22-88a1-210ddc39dc55 APIVersion:apps/v1 ResourceVersion:958987 FieldPath:} reason="AggregateStatusSucceed"
I1227 10:05:49.109651 1 request.go:696] Waited for 1.022718291s due to client-side throttling, not priority and fairness, request: GET:https://192.168.120.50:6443/apis/snapshot.storage.k8s .io/v1
W1227 10:05:49.811792 1 cluster_status_controller.go:237] Maybe get partial(67) APIs installed in Cluster idc-hyper-sit-1. Error: unable to retrieve the complete list of server APIs: acme.yourcompany.com/v1alpha1 : the server is currently unable to handle the request, metrics.k8s.io/v1beta1: the server is currently unable to handle the request.
I1227 10:05:51.230784 1 request.go:696] Waited for 1.022560257s due to client-side throttling, not priority and fairness, request: GET:https://192.168.120.50:6443/apis/application.kubesphere.io /v1alpha1
W1227 10:05:51.932665 1 cluster_status_controller.go:237] Maybe get partial(67) APIs installed in Cluster idc-hyper-sit-1. Error: unable to retrieve the complete list of server APIs: acme.yourcompany.com/v1alpha1 : the server is currently unable to handle the request, metrics.k8s.io/v1beta1: the server is currently unable to handle the request.
I1227 10:06:01.230738 1 request.go:696] Waited for 1.020999528s due to client-side throttling, not priority and fairness, request: GET:https://192.168.120.50:6443/apis/fluentd.fluent.io /v1alpha1
W1227 10:06:01.935454 1 cluster_status_controller.go:237] Maybe get partial(67) APIs installed in Cluster idc-hyper-sit-1. Error: unable to retrieve the complete list of server APIs: acme.yourcompany.com/v1alpha1 : the server is currently unable to handle the request, metrics.k8s.io/v1beta1: the server is currently unable to handle the request.
I1227 10:06:07.390602 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I1227 10:06:07.390636 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I1227 10:06:07.390781 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I1227 10:06:07.390953 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I1227 10:06:07.390973 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I1227 10:06:07.390954 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I1227 10:06:07.391154 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I1227 10:06:07.390580 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
E1227 10:06:11.938697 1 cluster_status_controller.go:406] Failed to do cluster health check for cluster idc-hyper-sit-1, err is: Get "https://192.168.120.50:6443/readyz": dial tcp 192.168 .120.50:6443: connect: connection refused
E1227 10:06:21.940779 1 cluster_status_controller.go:406] Failed to do cluster health check for cluster idc-hyper-sit-1, err is: Get "https://192.168.120.50:6443/readyz": dial tcp 192.168 .120.50:6443: connect: connection refused

More log information is in the attachment karmada-controller-manager.log

karmada-io / karmada

ResourceBinding displays abnormally after being evicted multiple times #4480