Tigera-operator schedules on cordoned nodes

uristernik commented 1 year ago

Expected Behavior

When draining a node that tigera-operator is scheduled on we expected tigera-operator pod to be terminated and scheduled on a node that is not cordoned.

Current Behavior

Tigera-operator got scheduled only on cordoned nodes. Even when killing forcefully the pod or restarting the deployment.

Steps to Reproduce (for bugs)

Install tigera-operator
Cordon multiple nodes
Drain the node that tigera-operator runs on

Context

We saw this issue reproduce twice:

Migrating to Karpenter from CAS, we are draining nodes.
When we upgraded AMI versions from 1.21 --> 1.22.

Your Environment

EKS version 1.22 Tigera-operator image version v1.27.16 Calico image version: v3.23.5 Deploying using helm chart

tmjd commented 1 year ago

This repo doesn't provide an end user install manifest so I'm assuming you are using it from the Calico docs. I think you should be able to change the deployment as you see fit, even if you're using the helm chart. I think this is probably happening because we assume the operator will be deployed on a cluster that will not have pod networking and therefore needs to tolerate NoExecute and NoSchedule so that it can be deployed on nodes without pod networking.

Do you have any suggestion on how we could support installing on clusters that start with no networking and still be able to avoid Cordoned nodes?

I'm wondering if we could put an (anti-)affinity to avoid nodes that were cordoned? And as long as it was an affinity (not required) then even when there were no nodes that matched the affinity the pod would still be scheduled.

uristernik commented 1 year ago

Yes, I understand but the behaviour we are seeing is seeing is that the scheduler prefers scheduling on cordoned node. When we have multiple cordoned node the tigera-operator will move from one cordoned node to another (the cordoned nodes don't have the the NoExecute nor NoSchedule taints).

I wonder if that's the expected behaviour, and if others are experiencing it too.

tmjd commented 1 year ago

the cordoned nodes don't have the the NoExecute nor NoSchedule taints

That sounds unexpected to me, from what I've found (searching) is that cordoning a node should mark/taint it Unschedulable. I still do not think this would impact where the tigera-operator is scheduled though, since it tolerates Unschedulable.

uristernik commented 1 year ago

Tomorrow we are migrating another cluster to Karpenter, I'll attach as much info as I can. Just to clarify:

The taints that the node gets aren't the one tigera-operator has
It looks like the scheduler prefers scheduling tigera-operator on the cordoned nodes

uristernik commented 1 year ago

I was wrong in my last message, I in fact do see the match in taint and toleration. But we are still seeing it hoping from one cordoned node to another until there's no more cordoned nodes in the cluster. Did you ever see this kind of behaviour?

tomsucho commented 1 year ago

I've seen the same during EKS upgrades, but it doesn't happen on all clusters, so there is some randomness to it. Since we use AWS VPC CNI and nodes come up with networking setup, I'm planning to remove these tolerations.

ayeks commented 1 year ago

@tomsucho Any progress on removing these tolerations? Having a hardcoded NoSchedule on the operator breaks a lot of workflows sadly. At least this should be configurable in the helm chart.

tomsucho commented 1 year ago

@ayeks in 3.25.1 I can only see the default tolerations on the tigera deployment pod as follows and those should be good I think:

  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

SammyA commented 1 year ago

Installing using helm chart defaults results in:

tolerations:
  - effect: NoExecute
    operator: Exists
  - effect: NoSchedule
    operator: Exists

resulting in cordoned nodes hopping. Overriding the chart values to:

tolerations:
  - effect: NoSchedule
    operator: Exists
    key: node.kubernetes.io/not-ready

seems to fix it.

tmjd commented 1 year ago

The helm chart has tolerations that can be changed and if using the deployment manifest the tolerations could be changed for those that need that. But I think the default needs to remain of tolerating NoExecute and NoSchedule.

To avoid cordoned nodes I think it would be acceptable to add a preferred affinity of Schedulable nodes, then operator could still be deployed if there were none (like at installation time when there might not be any Schedulable nodes). I'm not sure if it is possible to affinitize for Schedulable, has anyone looked into that? I'd be happy to review a PR to the operator chart in the calico repo with a change like that.

caseydavenport commented 8 months ago

Going to close this as tolerations can be configured in the helm chart, and this PR adds affinity: https://github.com/projectcalico/calico/pull/8095

tigera / operator