Bug Report envolving ReplicaSchedulingPreference

marinoborges commented 3 years ago

What happened: I've set up kubefed with 2 clusters and I was testing RSP with targetKind FederatedDeployment and totalReplicas=1 with greater weight on the downstream cluster. The pod is scheduled and gets ready on downstream cluster accordingly. My attempt was to check whether the host cluster will schedule the pod when cordoning all downstream clusters' workers. What I found after cordoning nodes is that the pod was evicted and then put in "Pending" state but in the didn't get moved even waiting for hours for the expected move. Interestingly, if the kubefed-controller-manager is manually deleted, the new kubefed-controller-manager actually does the move accordingly, including moving back to downstream cluster after uncordoning the nodes. But the same issue happens again if I cordon the downstream cluster nodes one more time. I raised the controller-manager logs to --v=5 as kindly suggested by fellow 'irfanurrehman' (slack channel #sig-multicluster) and didn't see any evidence that the controller noticed the pod unavailability. Logs below.

What you expected to happen: I expected the pod to move to the host cluster when downstream cluster got the only node cordoned and pods in pending state.

How to reproduce it (as minimally and precisely as possible):

Create FederatedNamespace resource
Create FederatedDeployment resource
Create ReplicaSchedulingPreference resource
Cordon nodes on downstream-cluster
Wait for pods to move

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.18.9-eks-d1db3c (host-cluster) and v1.19.7 (downstream cluster)
KubeFed version: 0.6.1
Scope of installation (namespaced or cluster): cluster
Others

RSP

apiVersion: scheduling.kubefed.io/v1alpha1
kind: ReplicaSchedulingPreference
metadata:
  name: deployment-100-108
  namespace: video-ingest
spec:
  targetKind: FederatedDeployment
  totalReplicas: 1
  rebalance: true
  clusters:
    "host-cluster":
      weight: 1
    "downstream-cluster":
      weight: 10

Logs Timeline:

I cordoned downstream cluster only node at 2:00:01
I collected the logs at 02:04:42 Notes:

There were no lines in the log between 02:00:11 and 02:04:42

0304 02:00:11.347071       1 controller.go:204] Starting to reconcile ReplicaSchedulingPreference controller triggered key named video-ingest/deployment-100-108
I0304 02:00:11.347523       1 controller.go:269] Starting to reconcile FederatedDeployment "video-ingest/deployment-100-108"
I0304 02:00:11.348268       1 controller.go:309] Ensuring Deployment "video-ingest/deployment-100-108" in clusters: downstream-cluster,host-cluster
I0304 02:00:11.351171       1 manager.go:175] No update necessary for PropagatedVersion "video-ingest/deployment-deployment-100-108"
I0304 02:00:11.351194       1 controller.go:388] Setting the federated status '{map[] false}'
I0304 02:00:11.351672       1 status.go:219] Clusters differs from the size
I0304 02:00:11.351686       1 status.go:192] Value of flags: propStatusUpdated: 'true'; statusUpdated 'true'; changesPropagated 'true'
I0304 02:00:11.351874       1 status.go:151] Setting the status of federated object 'deployment-100-108' and resource object 'deployment-100-108'
I0304 02:00:11.351884       1 controller.go:426] Updating status for FederatedDeployment "video-ingest/deployment-100-108"
I0304 02:00:11.369748       1 replicascheduler.go:304] Schedule - "video-ingest/deployment-100-108"
downstream-cluster: current: 1 target: 1
host-cluster: current: 0 target: 0
I0304 02:00:11.373659       1 controller.go:272] Finished reconciling FederatedDeployment "video-ingest/deployment-100-108" (duration: 26.113007ms)
I0304 02:00:11.374705       1 controller.go:269] Starting to reconcile FederatedDeployment "video-ingest/deployment-100-108"
I0304 02:00:11.375161       1 controller.go:309] Ensuring Deployment "video-ingest/deployment-100-108" in clusters: downstream-cluster,host-cluster
I0304 02:00:11.377523       1 manager.go:175] No update necessary for PropagatedVersion "video-ingest/deployment-deployment-100-108"
I0304 02:00:11.377547       1 controller.go:388] Setting the federated status '{map[] false}'
I0304 02:00:11.377966       1 status.go:219] Clusters differs from the size
I0304 02:00:11.377978       1 status.go:192] Value of flags: propStatusUpdated: 'true'; statusUpdated 'true'; changesPropagated 'true'
I0304 02:00:11.378170       1 status.go:151] Setting the status of federated object 'deployment-100-108' and resource object 'deployment-100-108'
I0304 02:00:11.378179       1 controller.go:426] Updating status for FederatedDeployment "video-ingest/deployment-100-108"
I0304 02:00:11.384287       1 controller.go:207] Finished reconciling ReplicaSchedulingPreference controller triggered key named video-ingest/deployment-100-108 (duration: 37.178234ms)
I0304 02:00:11.384352       1 controller.go:204] Starting to reconcile ReplicaSchedulingPreference controller triggered key named video-ingest/deployment-100-108
I0304 02:00:11.396557       1 controller.go:272] Finished reconciling FederatedDeployment "video-ingest/deployment-100-108" (duration: 21.836212ms)
I0304 02:00:11.407621       1 replicascheduler.go:304] Schedule - "video-ingest/deployment-100-108"
downstream-cluster: current: 1 target: 1
host-cluster: current: 0 target: 0
I0304 02:00:11.449230       1 controller.go:207] Finished reconciling ReplicaSchedulingPreference controller triggered key named video-ingest/deployment-100-108 (duration: 64.863195ms)
W0304 02:04:42.599451       1 warnings.go:67] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress

/kind bug

irfanurrehman commented 3 years ago

/assign

irfanurrehman commented 3 years ago

@marinoborges So I had a look at this and the code so far looks allright. Your scenario is that the federated resource (in your case a deployment in joined clusters) is supposed to be observed by RSP to balance replicas. In the situation that you mentioned what does not seem to happen is the trigger of reconciliation when the pods change state.

The RSP reconciliation is set to trigger on any change in the RSP resource, the federated resource (federated deployment) and the actual k8s resources (deployments). When the pods change status, the ready replica status should change on the k8s resource, in turn recording an update on that resource, which should be observed by the RSP scheduler which should ideally reconcile. May be I am missing something in the chain.

@marinoborges would you mind sharing the yamls and steps to recreate this scenario (I understand the scenario, but this will save some time for me), so that I can take a deeper look into this.

kubernetes-retired / kubefed

Bug Report envolving ReplicaSchedulingPreference #1369