kubernetes-retired / kubefed

Kubernetes Cluster Federation
Apache License 2.0
2.5k stars 531 forks source link

Bug Report envolving ReplicaSchedulingPreference #1369

Closed marinoborges closed 3 years ago

marinoborges commented 3 years ago

What happened: I've set up kubefed with 2 clusters and I was testing RSP with targetKind FederatedDeployment and totalReplicas=1 with greater weight on the downstream cluster. The pod is scheduled and gets ready on downstream cluster accordingly. My attempt was to check whether the host cluster will schedule the pod when cordoning all downstream clusters' workers. What I found after cordoning nodes is that the pod was evicted and then put in "Pending" state but in the didn't get moved even waiting for hours for the expected move. Interestingly, if the kubefed-controller-manager is manually deleted, the new kubefed-controller-manager actually does the move accordingly, including moving back to downstream cluster after uncordoning the nodes. But the same issue happens again if I cordon the downstream cluster nodes one more time. I raised the controller-manager logs to --v=5 as kindly suggested by fellow 'irfanurrehman' (slack channel #sig-multicluster) and didn't see any evidence that the controller noticed the pod unavailability. Logs below.

What you expected to happen: I expected the pod to move to the host cluster when downstream cluster got the only node cordoned and pods in pending state.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

RSP

apiVersion: scheduling.kubefed.io/v1alpha1
kind: ReplicaSchedulingPreference
metadata:
  name: deployment-100-108
  namespace: video-ingest
spec:
  targetKind: FederatedDeployment
  totalReplicas: 1
  rebalance: true
  clusters:
    "host-cluster":
      weight: 1
    "downstream-cluster":
      weight: 10

Logs Timeline:

/kind bug

irfanurrehman commented 3 years ago

/assign

irfanurrehman commented 3 years ago

@marinoborges So I had a look at this and the code so far looks allright. Your scenario is that the federated resource (in your case a deployment in joined clusters) is supposed to be observed by RSP to balance replicas. In the situation that you mentioned what does not seem to happen is the trigger of reconciliation when the pods change state.

The RSP reconciliation is set to trigger on any change in the RSP resource, the federated resource (federated deployment) and the actual k8s resources (deployments). When the pods change status, the ready replica status should change on the k8s resource, in turn recording an update on that resource, which should be observed by the RSP scheduler which should ideally reconcile. May be I am missing something in the chain.

@marinoborges would you mind sharing the yamls and steps to recreate this scenario (I understand the scenario, but this will save some time for me), so that I can take a deeper look into this.