karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.14k stars 812 forks source link

deployment propagated failed after deployment replicas syncer feature enabled #4758

Closed chaosi-zju closed 1 month ago

chaosi-zju commented 1 month ago

What happened:

ci e2e test failed: https://github.com/karmada-io/karmada/actions/runs/8336573479/job/22814077291

just propagate 2 replicas into two clusters with static weight 1:1, it is expected to see 2/2 when getting deployment.

however, after #4707 merged, it actually may got 1/1 result with a high probability.

What you expected to happen:

it is expected to see 2/2 when getting deployment finally.

How to reproduce it (as minimally and precisely as possible):

just propagate 2 replicas into two clusters with static weight 1:1.

Anything else we need to know?:

Environment:

chaosi-zju commented 1 month ago

Reason:

propagate 2 replicas, assuming when 1 replica sync on member1 cluster, while another replica hasn't been synced on member2, the binding may get a status just like following:

"aggregatedStatus": [{
  "clusterName": "member1",
  "status": {
    "replicas": 1,
    "unavailableReplicas": 1,
    "updatedReplicas": 1
  },
  "applied": true,
  "health": "Unhealthy"
}, {
  "clusterName": "member2",
  "status": {},
  "applied": true,
  "health": "Unhealthy"
}]

pay attention to the line "status": {}

and deployment status just like:

"status": {
  "observedGeneration": 2,
  "replicas": 1,
  "updatedReplicas": 1,
  "unavailableReplicas": 1
}

then the newly added deploymentReplicasSyncerController will update spec.replicas from 2 to 1, which would interrupt the replicas sync to member2, and finally result in 1/1.

Optional resolution:

in addition to current wait condition: binding.status.aggregatedStatus.status != nil, we should add another detail wait condition: binding.status.aggregatedStatus.status.replicas > 0.

chaosi-zju commented 1 month ago

/assign chaosi-zju

4729