Move Control Planes taint to kubelet config instead of markcontrolplane phase

yagonobre commented 5 years ago

Is this a BUG REPORT or FEATURE REQUEST?

Choose one: BUG REPORT

What happened?

Due we apply the control plane taint after the control plane comes up, on a multi-control plane case we can have pod scheduled to this control plane.

What you expected to happen?

Use the kubelet --register-with-taints config instead of handler it on a separate phase.

How to reproduce it (as minimally and precisely as possible)?

Kubeadm init
Create a daemonset
Join another control plane

Anything else we need to know?

For now, I just use this config but would be nice if kubeadm can handler it.

apiVersion: kubeadm.k8s.io/v1beta1
kind: InitConfiguration
nodeRegistration:
  # Remove the default control-plane taint so we taint it manually with KUBELET_EXTRA_ARGS
  taints: []

neolit123 commented 5 years ago

i guess this can be a real problem. thanks for the report @yagonobre

does the kubelet allow self-taints with node roles such as node-role.kubernetes.io/master=:NoSchedule? (it certainly does not for labels https://github.com/kubernetes/kubernetes/issues/45361)

yagonobre commented 5 years ago

does the kubelet allow self-taints with node roles such as node-role.kubernetes.io/master=:NoSchedule?

Yes, probably we can keep the phase to add the label

madhukar32 commented 5 years ago

@neolit123 : Can I take a crack at this?

neolit123 commented 5 years ago

@madhukar32 hi, this needs discussion before sending the PR. if we remove the taint from markcontrolplane phase it will break existing users.

yagonobre commented 5 years ago

I was away, but now I'll have enough time to do this. @neolit123 what you think about add the taint to the kubelet config, and keep it on the markcontrolplane phase just for backward compatibility?

neolit123 commented 5 years ago

we might have to keep it in both places with a deprecation notice.

blurpy commented 4 years ago

We are seeing this when joining control plane nodes to existing clusters (1.15). We use nginx-ingress-controller as a daemonset, and it's on host port 443, same as the apiserver. So the apiserver always ends up in a CrashLoop until I manually delete the pod.

neolit123 commented 4 years ago

this is a known problem. changes in kubeadm phases are tricky - the existing workaround can be seen above, but we might have to have a period of time where we both taint using the kubelet configuration and the kubeadm mark-control-plane phase, potentially deprecating the tainting in the phase in the future.

blurpy commented 4 years ago

Not sure I understand how the workaround works. Isn't InitConfiguration used only during init of the first master? Or can it be updated in the configmap in kube-system and used during join --control-plane?

neolit123 commented 4 years ago

both init and join configurations have the node registration options: https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2#NodeRegistrationOptions

tainting using the KubeletConfigurtation is not possible.

blurpy commented 4 years ago

I've tried to use a JoinConfiguration to setup additional masters, but then I just get this message:

# kubeadm join --control-plane --config /etc/kubernetes/kubeadm-master-config.yaml
     can not mix '--config' with arguments [control-plane]

Related issue I found: #1485

yagonobre commented 4 years ago

Today I add the taint directly to the kubelet config file. I'll try to work on it soon.

neolit123 commented 4 years ago

Yago, i dont see a field for that in the kubeletconfiguration.

neolit123 commented 4 years ago

I've tried to use a JoinConfiguration to setup additional masters, but then I just get this message:

The join configuration can do that. Some flags and config cannot be mixed.

yagonobre commented 4 years ago

Yago, i dont see a field for that in the kubeletconfiguration.

Sorry, it's a flag. I use KUBELET_EXTRA_ARGS="--register-with-taints=node-role.kubernetes.io/master=:NoSchedule"

neolit123 commented 4 years ago

@yagonobre i just tied passing --register-with-taints=node-role.kubernetes.io/master:NoSchedule to the kubelet instead of using the markcontorlplane phase for the taint and both the CNI (calico) and coredns are stuck in Pending saying that there is no Node to schedule on (even if they tolerate the master taint).

what k8s version have you tried this with? i'm testing with the latest 1.18.

yagonobre commented 4 years ago

I'm using v1.16.2, but I'll try with the last version.

On Thu, Dec 5, 2019 at 7:59 PM Lubomir I. Ivanov notifications@github.com wrote:

@yagonobre https://github.com/yagonobre i just tied passing --register-with-taints=node-role.kubernetes.io/maste r:NoSchedule to the kubelet instead of using the markcontorlplane phase for the taint and both the CNI and coredns are stuck in Pending saying that there is no Node to schedule on (even if they tolerate the master taint).

what k8s version have you tried this with? i'm testing with the latest 1.18.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubeadm/issues/1621?email_source=notifications&email_token=ACJ5C2YRJKOV45LD5QDMOBTQXGBVDA5CNFSM4HZKKCA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGCOCCQ#issuecomment-562356490, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ5C24GER64KYQH5XHTMRDQXGBVDANCNFSM4HZKKCAQ .

neolit123 commented 4 years ago

@yagonobre never mind. i made the mistake of deploying a kubelet much newer than the kube-apiserver -> 1.18 vs 1.16.3.

will send a POC PR that fixes this in a bit. although it needs discussion.

neolit123 commented 4 years ago

here is the POC https://github.com/kubernetes/kubernetes/pull/85977

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

neolit123 commented 4 years ago

/remove-lifecycle stale

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

neolit123 commented 4 years ago

/remove-lifecycle stale

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

neolit123 commented 3 years ago

/remove-lifecycle stale

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

neolit123 commented 3 years ago

/remove-lifecycle stale

neolit123 commented 3 years ago

minor update here, we could move the taint registration to the kubelet, but one major problemis that kubeletconfiguration v1beta1 does not support taints via config: https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/

there is only the CLI flag --register-with-taints and we might want to stop using all kubelet flags, because they are being removed? https://github.com/kubernetes/kubernetes/issues/86843

TBBle commented 3 years ago

I think this ticket is the motivation for --register-with-taints to be moved to kubelet configuration, rather than dropped. The problem isn't well-solvable without being able to include some taint as part of the initial node registration, either the final intended taint (control-plane, in this case), or some intermediate taint that says "Don't schedule to me under any circumstances, I have taints and perhaps labels still-to-come". The latter approach would needlessly complicate bootstrapping, I think.

The "security discussions" apply somewhat to --node-labels, but restricting kubelet self-labelling is supposed to address that, and off the top of my head, a taint can't be added to attract workloads, only to repel them, so that isn't applicable to --register-with-taints.

neolit123 commented 3 years ago

I think this ticket is the motivation for --register-with-taints to be moved to kubelet configuration, rather than dropped.

yes, the linked kubelet issue talks about removing flags and adding them as fields in config. though it's mostly stale, best effort and there are some decisions to be made around breaking users. on the kubeadm side we are hesitant about adding more kubelet flags that kubeadm manages.

users can opt-in with NodeRegistrationOptions.kubeletExtraArgs, but they would have to manage the deprecations/removals of said flags.

The "security discussions" apply somewhat to --node-labels, but restricting kubelet self-labelling is supposed to address that, and off the top of my head, a taint can't be added to attract workloads, only to repel them, so that isn't applicable to --register-with-taints.

yes, taints = repel, label = attract. both must be applied on the initial Node object, before workloads schedule on it. this also suggest the restricted node-role label that kubeadm applies for control-planes should not be used for workload scheduling as it is applied later in the process by a privileged client.

ideally users should apply custom, non-restricted labels on kubelet startup.

randomvariable commented 3 years ago

@adisky do you know of any plans to migrate the --register-with-taints flag to kubelet config?

adisky commented 3 years ago

@randomvariable --register-with-taints looks like instance specific flag and the instance config KEP is stalled https://github.com/kubernetes/kubernetes/issues/61655 https://github.com/kubernetes/enhancements/pull/1439

Interested in knowing the experience with this flag, is it usually same for all nodes in cluster or set differently on each node? I think it could be both depending on use cases. If there is any value to add it to the generic kubelet config a PR can be created now but if major use case is instance specific then we have to wait for the enhancement

neolit123 commented 3 years ago

Interested in knowing the experience with this flag, is it usually same for all nodes in cluster or set differently on each node?

I don't think anyone has exact metrics. The flag value can be different on one or more nodes. It is about node taints, after all.

adisky commented 3 years ago

I don't think anyone has exact metrics. The flag value can be different on one or more nodes. It is about node taints, after all.

@neolit123 was not asking exact metrics just the general usage:). It makes sense than to wait for instance config KEP and seems no benefit to migrate now

neolit123 commented 3 years ago

The KEP PR is stale and should be closed with tranfer of ownership. The author recently asked if anyone wishes to take it in an email thread discussing the disband of wg component standard. On Aug 20, 2021 14:00, "Aditi Sharma" @.***> wrote:

I don't think anyone has exact metrics. The flag value can be different on one or more nodes. It is about node taints, after all.

@neolit123 https://github.com/neolit123 was not asking exact metrics just the general usage:). It makes sense than to wait for instance config KEP and seems no benefit to migrate now

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubeadm/issues/1621#issuecomment-902612864, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACRATEGV2R6KLBTASKYN2DT5YYWTANCNFSM4HZKKCAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

pacoxu commented 2 years ago

https://github.com/kubernetes/kubernetes/pull/105437 was merged in v1.23 and we are ready to move this forward in the 1.25 cycle for the n-1 skew kubelet support.

registerWithTaints:
- key: "node-role.kubernetes.io/master"
  operator: "Exists"
  effect: "NoSchedule"

{
  "kubeletconfig": {
    "enableServer": true,
    "staticPodPath": "/etc/kubernetes/manifests",
    "syncFrequency": "1m0s",
    "fileCheckFrequency": "20s",
    "httpCheckFrequency": "20s",
    "address": "0.0.0.0",
    "port": 10250,
    "tlsCertFile": "/var/lib/kubelet/pki/kubelet.crt",
    "tlsPrivateKeyFile": "/var/lib/kubelet/pki/kubelet.key",
    "rotateCertificates": true,
    "authentication": {
      "x509": {
        "clientCAFile": "/etc/kubernetes/pki/ca.crt"
      },
      "webhook": {
        "enabled": true,
        "cacheTTL": "2m0s"
      },
      "anonymous": {
        "enabled": false
      }
    },
    "authorization": {
      "mode": "Webhook",
      "webhook": {
        "cacheAuthorizedTTL": "5m0s",
        "cacheUnauthorizedTTL": "30s"
      }
    },
    "registryPullQPS": 5,
    "registryBurst": 10,
    "eventRecordQPS": 5,
    "eventBurst": 10,
    "enableDebuggingHandlers": true,
    "healthzPort": 10248,
    "healthzBindAddress": "127.0.0.1",
    "oomScoreAdj": -999,
    "clusterDomain": "cluster.local",
    "clusterDNS": [
      "10.96.0.10"
    ],
    "streamingConnectionIdleTimeout": "4h0m0s",
    "nodeStatusUpdateFrequency": "10s",
    "nodeStatusReportFrequency": "5m0s",
    "nodeLeaseDurationSeconds": 40,
    "imageMinimumGCAge": "2m0s",
    "imageGCHighThresholdPercent": 85,
    "imageGCLowThresholdPercent": 80,
    "volumeStatsAggPeriod": "1m0s",
    "cgroupsPerQOS": true,
    "cgroupDriver": "systemd",
    "cpuManagerPolicy": "none",
    "cpuManagerReconcilePeriod": "10s",
    "memoryManagerPolicy": "None",
    "topologyManagerPolicy": "none",
    "topologyManagerScope": "container",
    "runtimeRequestTimeout": "2m0s",
    "hairpinMode": "promiscuous-bridge",
    "maxPods": 130,
    "podPidsLimit": -1,
    "resolvConf": "/etc/resolv.conf",
    "cpuCFSQuota": true,
    "cpuCFSQuotaPeriod": "100ms",
    "nodeStatusMaxImages": 50,
    "maxOpenFiles": 1000000,
    "contentType": "application/vnd.kubernetes.protobuf",
    "kubeAPIQPS": 5,
    "kubeAPIBurst": 10,
    "serializeImagePulls": false,
    "evictionHard": {
      "imagefs.available": "15%",
      "memory.available": "100Mi",
      "nodefs.available": "10%",
      "nodefs.inodesFree": "5%"
    },
    "evictionPressureTransitionPeriod": "5m0s",
    "enableControllerAttachDetach": true,
    "makeIPTablesUtilChains": true,
    "iptablesMasqueradeBit": 14,
    "iptablesDropBit": 15,
    "failSwapOn": true,
    "memorySwap": {},
    "containerLogMaxSize": "10Mi",
    "containerLogMaxFiles": 5,
    "configMapAndSecretChangeDetectionStrategy": "Watch",
    "enforceNodeAllocatable": [
      "pods"
    ],
    "volumePluginDir": "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/",
    "logging": {
      "format": "text",
      "flushFrequency": 5000000000,
      "verbosity": 0,
      "options": {
        "json": {
          "infoBufferSize": "0"
        }
      }
    },
    "enableSystemLogHandler": true,
    "shutdownGracePeriod": "0s",
    "shutdownGracePeriodCriticalPods": "0s",
    "enableProfilingHandler": true,
    "enableDebugFlagsHandler": true,
    "seccompDefault": false,
    "memoryThrottlingFactor": 0.8,
    "registerWithTaints": [
      {
        "key": "node-role.kubernetes.io/master",
        "effect": "NoSchedule"
      }
    ],
    "registerNode": true
  }
}

pacoxu commented 2 years ago

/assign I will work on this later. Feel free to work on it.

pacoxu commented 2 years ago

During the implementation of https://github.com/kubernetes/kubernetes/pull/10946, I met a problem here.

If I add taint to the kubelet configuration on a control plane node, we will upload it to the configmap kubelet-config for all nodes.

My current proposal would be

kubeadm init
- generate kubelet-config with the control-plane taint on control-plane during kubeadm init
- upload kubelet-config without the taint
kuebadm join
- download kubelet-config from configmap for most cases
- merge config with the control-plane taint on the control-plane node during kubeadm join as the control-plane node(this logic is a special logic. the taint will be added to the yaml.)
kubeadm upgrade
- use configmap directly as the node is already marked with the taint. (Removing the taint from kubelet-config is also acceptable).
- the kubelet config on the control plane would change.(Or we have to use a special logic as kubeadm join as control-plane).

Another proposal would be

Should we divide the configmap into two: control-plane-kubelet-config & kubelet-config?

neolit123 commented 2 years ago

While we established that the tainting is late and can cause problems we have not seen reports from more users about it. Paco, are you seeing the problem on your side with customers?

Before we proceed with PRs i think we need to get back to the discussion if we want to do it. It also feels a bit odd to separete the taint and labels. Perphaps we want them to continue to be done ny the kubeadm client instead of kubelet.

dborman-hpe commented 2 years ago

FYI we are having issues with the taints getting added late when adding a new control-plane node. Our scenario is that we have 3 physical nodes, and each physical node has two VMs, one is a control-plane and one is a worker. When we need to replace a failed physical node, it means we are replacing both of these VMs. The way the VMs are brought up, the control-plane VM usually joins the cluster before the worker VM. During the window between when the control-plane VM becomes available and when kubeadm applies the control-plane taint, kubernetes can schedule Pending and daemonset pods that do not have a toleration for the control-plane on the control-plane VM. We then have to go through a clean-up procedure to remove the pods that slipped in. In some cases all we have to do is delete the pods, but in other cases there are dependencies, such as on PVCs, that complicate the cleanup. Ideally the best solution would be to eliminate this window between when the node becomes schedulable and when the taints are applied.

pacoxu commented 2 years ago

Since we support kubeletconfiguration patches, we can use --patches to achieve this in v1.25 and later since https://github.com/kubernetes/kubeadm/issues/1682.

kubeadm join --patches=/tmp/kubeadm-patches xxx

The /tmp/kubeadm-patches/kubeletconfiguration+strategic.yaml looks like below.

registerWithTaints:
- key: "node-role.kubernetes.io/control-plane"
  operator: "Exists"
  effect: "NoSchedule"

neolit123 commented 2 years ago

Patches can work yes. Currently both the labels and taints are applied in the same phase - mark-control-plane. Which is nice and contained. If we are changing the phase we need a plan because users execute it on demand too.

I personally have not seen many reports of the problem discussed here and there are already workarounds by passing custom blocking taints on node registration and removing them after n minutes. Or i guess just passing a kubelet config patch.

EDIT: I wonder if we should just document this.

pacoxu commented 2 years ago

/unassign

EDIT: I wonder if we should just document this.

Agree.

neolit123 commented 2 years ago

While we established that the tainting is late and can cause problems we have not seen reports from more users about it. Paco, are you seeing the problem on your side with customers?

Before we proceed with PRs i think we need to get back to the discussion if we want to do it. It also feels a bit odd to separete the taint and labels. Perphaps we want them to continue to be done ny the kubeadm client instead of kubelet.

pacoxu commented 2 years ago

Paco, are you seeing the problem on your side with customers?

Not meet this recently. BTW, this is not a P0 or P1 bug, maybe P2-P4 issue as it happened during cluster maintenance.

It also feels a bit odd to separate the taint and labels. Perhaps we want them to continue to be done by the kubeadm client instead of kubelet.

If we move it to the kubelet, the taint will be back once the kubelet restarts. It would make some users confused if they remove the tainted on purpose. It seems to be a behavior change.

neolit123 commented 2 years ago

If we move it to the kubelet, the taint will be back once the kubelet restarts. It would make some users confused if they remove the tainted on purpose.

that is indeed a concern

pacoxu commented 2 years ago

If we move it to the kubelet, the taint will be back once the kubelet restarts. It would make some users confused if they remove the tainted on purpose.

that is indeed a concern

IMO, it is a design problem of kubelet. One solution that I know is that kubelet should only add the node labels or taint only if it is the first time bootstrap and register to apiserver. After the bootstrap(node is already added to cluster), kubelet should not change its node label or taint itself and it should respect the apiserver/etcd storage. But this is another topic. If so, any cluster installer can add the default taint/labels during joining. It would make no disturbance in later kubelet restarts.

neolit123 commented 2 years ago

true, so unless the kubelet changes its design we can play around it by adding the taints once and then removing them from the kubelet flags or local config. i must admit i am not a big fan of this solution as it creates more difficulties for us.

we can continue doing it in kubeadm with a client. perhaps sooner, but as far as i understand there could still be a short period when the taints are not there yet.

IMO this is not a very common issue. ideally users should create the control plane nodes and only then start applying workloads such as daemonsets.

pacoxu commented 2 years ago

To summarize,

Workaround: use --patches when join new control plane. See https://github.com/kubernetes/kubeadm/issues/1621#issuecomment-1098649397.
Problems that should be solved before we implement it.
- - restart will add the taint back(Is this expected?). See https://github.com/kubernetes/kubeadm/issues/1621#issuecomment-1274423663.
- - control-plane will use a different kubelet-config from workers, special handling is needed. See https://github.com/kubernetes/kubeadm/issues/1621#issuecomment-1098649397.

kubernetes / kubeadm