Open yagonobre opened 5 years ago
i guess this can be a real problem. thanks for the report @yagonobre
does the kubelet allow self-taints with node roles such as node-role.kubernetes.io/master=:NoSchedule
?
(it certainly does not for labels https://github.com/kubernetes/kubernetes/issues/45361)
does the kubelet allow self-taints with node roles such as node-role.kubernetes.io/master=:NoSchedule?
Yes, probably we can keep the phase to add the label
@neolit123 : Can I take a crack at this?
@madhukar32 hi, this needs discussion before sending the PR. if we remove the taint from markcontrolplane phase it will break existing users.
I was away, but now I'll have enough time to do this. @neolit123 what you think about add the taint to the kubelet config, and keep it on the markcontrolplane phase just for backward compatibility?
we might have to keep it in both places with a deprecation notice.
We are seeing this when joining control plane nodes to existing clusters (1.15). We use nginx-ingress-controller as a daemonset, and it's on host port 443, same as the apiserver. So the apiserver always ends up in a CrashLoop until I manually delete the pod.
this is a known problem. changes in kubeadm phases are tricky - the existing workaround can be seen above, but we might have to have a period of time where we both taint using the kubelet configuration and the kubeadm mark-control-plane phase, potentially deprecating the tainting in the phase in the future.
Not sure I understand how the workaround works. Isn't InitConfiguration used only during init of the first master? Or can it be updated in the configmap in kube-system and used during join --control-plane?
both init and join configurations have the node registration options: https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2#NodeRegistrationOptions
tainting using the KubeletConfigurtation is not possible.
I've tried to use a JoinConfiguration to setup additional masters, but then I just get this message:
# kubeadm join --control-plane --config /etc/kubernetes/kubeadm-master-config.yaml
can not mix '--config' with arguments [control-plane]
Related issue I found: #1485
Today I add the taint directly to the kubelet config file. I'll try to work on it soon.
Yago, i dont see a field for that in the kubeletconfiguration.
I've tried to use a JoinConfiguration to setup additional masters, but then I just get this message:
The join configuration can do that. Some flags and config cannot be mixed.
Yago, i dont see a field for that in the kubeletconfiguration.
Sorry, it's a flag.
I use KUBELET_EXTRA_ARGS="--register-with-taints=node-role.kubernetes.io/master=:NoSchedule"
@yagonobre i just tied passing --register-with-taints=node-role.kubernetes.io/master:NoSchedule
to the kubelet instead of using the markcontorlplane phase for the taint and both the CNI (calico) and coredns are stuck in Pending saying that there is no Node to schedule on (even if they tolerate the master taint).
what k8s version have you tried this with? i'm testing with the latest 1.18.
I'm using v1.16.2
, but I'll try with the last version.
On Thu, Dec 5, 2019 at 7:59 PM Lubomir I. Ivanov notifications@github.com wrote:
@yagonobre https://github.com/yagonobre i just tied passing --register-with-taints=node-role.kubernetes.io/maste r:NoSchedule to the kubelet instead of using the markcontorlplane phase for the taint and both the CNI and coredns are stuck in Pending saying that there is no Node to schedule on (even if they tolerate the master taint).
what k8s version have you tried this with? i'm testing with the latest 1.18.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubeadm/issues/1621?email_source=notifications&email_token=ACJ5C2YRJKOV45LD5QDMOBTQXGBVDA5CNFSM4HZKKCA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGCOCCQ#issuecomment-562356490, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ5C24GER64KYQH5XHTMRDQXGBVDANCNFSM4HZKKCAQ .
@yagonobre never mind. i made the mistake of deploying a kubelet much newer than the kube-apiserver -> 1.18 vs 1.16.3.
will send a POC PR that fixes this in a bit. although it needs discussion.
here is the POC https://github.com/kubernetes/kubernetes/pull/85977
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
minor update here, we could move the taint registration to the kubelet, but one major problemis that kubeletconfiguration v1beta1 does not support taints via config: https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/
there is only the CLI flag --register-with-taints and we might want to stop using all kubelet flags, because they are being removed? https://github.com/kubernetes/kubernetes/issues/86843
I think this ticket is the motivation for --register-with-taints
to be moved to kubelet configuration, rather than dropped. The problem isn't well-solvable without being able to include some taint as part of the initial node registration, either the final intended taint (control-plane, in this case), or some intermediate taint that says "Don't schedule to me under any circumstances, I have taints and perhaps labels still-to-come". The latter approach would needlessly complicate bootstrapping, I think.
The "security discussions" apply somewhat to --node-labels
, but restricting kubelet self-labelling is supposed to address that, and off the top of my head, a taint can't be added to attract workloads, only to repel them, so that isn't applicable to --register-with-taints
.
I think this ticket is the motivation for --register-with-taints to be moved to kubelet configuration, rather than dropped.
yes, the linked kubelet issue talks about removing flags and adding them as fields in config. though it's mostly stale, best effort and there are some decisions to be made around breaking users. on the kubeadm side we are hesitant about adding more kubelet flags that kubeadm manages.
users can opt-in with NodeRegistrationOptions.kubeletExtraArgs, but they would have to manage the deprecations/removals of said flags.
The "security discussions" apply somewhat to --node-labels, but restricting kubelet self-labelling is supposed to address that, and off the top of my head, a taint can't be added to attract workloads, only to repel them, so that isn't applicable to --register-with-taints.
yes, taints = repel, label = attract. both must be applied on the initial Node object, before workloads schedule on it. this also suggest the restricted node-role label that kubeadm applies for control-planes should not be used for workload scheduling as it is applied later in the process by a privileged client.
ideally users should apply custom, non-restricted labels on kubelet startup.
@adisky do you know of any plans to migrate the --register-with-taints
flag to kubelet config?
@randomvariable --register-with-taints
looks like instance specific flag and the instance config KEP is stalled
https://github.com/kubernetes/kubernetes/issues/61655
https://github.com/kubernetes/enhancements/pull/1439
Interested in knowing the experience with this flag, is it usually same for all nodes in cluster or set differently on each node? I think it could be both depending on use cases. If there is any value to add it to the generic kubelet config a PR can be created now but if major use case is instance specific then we have to wait for the enhancement
Interested in knowing the experience with this flag, is it usually same for all nodes in cluster or set differently on each node?
I don't think anyone has exact metrics. The flag value can be different on one or more nodes. It is about node taints, after all.
I don't think anyone has exact metrics. The flag value can be different on one or more nodes. It is about node taints, after all.
@neolit123 was not asking exact metrics just the general usage:). It makes sense than to wait for instance config KEP and seems no benefit to migrate now
The KEP PR is stale and should be closed with tranfer of ownership. The author recently asked if anyone wishes to take it in an email thread discussing the disband of wg component standard. On Aug 20, 2021 14:00, "Aditi Sharma" @.***> wrote:
I don't think anyone has exact metrics. The flag value can be different on one or more nodes. It is about node taints, after all.
@neolit123 https://github.com/neolit123 was not asking exact metrics just the general usage:). It makes sense than to wait for instance config KEP and seems no benefit to migrate now
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubeadm/issues/1621#issuecomment-902612864, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACRATEGV2R6KLBTASKYN2DT5YYWTANCNFSM4HZKKCAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
https://github.com/kubernetes/kubernetes/pull/105437 was merged in v1.23 and we are ready to move this forward in the 1.25 cycle for the n-1 skew kubelet support.
registerWithTaints:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
{
"kubeletconfig": {
"enableServer": true,
"staticPodPath": "/etc/kubernetes/manifests",
"syncFrequency": "1m0s",
"fileCheckFrequency": "20s",
"httpCheckFrequency": "20s",
"address": "0.0.0.0",
"port": 10250,
"tlsCertFile": "/var/lib/kubelet/pki/kubelet.crt",
"tlsPrivateKeyFile": "/var/lib/kubelet/pki/kubelet.key",
"rotateCertificates": true,
"authentication": {
"x509": {
"clientCAFile": "/etc/kubernetes/pki/ca.crt"
},
"webhook": {
"enabled": true,
"cacheTTL": "2m0s"
},
"anonymous": {
"enabled": false
}
},
"authorization": {
"mode": "Webhook",
"webhook": {
"cacheAuthorizedTTL": "5m0s",
"cacheUnauthorizedTTL": "30s"
}
},
"registryPullQPS": 5,
"registryBurst": 10,
"eventRecordQPS": 5,
"eventBurst": 10,
"enableDebuggingHandlers": true,
"healthzPort": 10248,
"healthzBindAddress": "127.0.0.1",
"oomScoreAdj": -999,
"clusterDomain": "cluster.local",
"clusterDNS": [
"10.96.0.10"
],
"streamingConnectionIdleTimeout": "4h0m0s",
"nodeStatusUpdateFrequency": "10s",
"nodeStatusReportFrequency": "5m0s",
"nodeLeaseDurationSeconds": 40,
"imageMinimumGCAge": "2m0s",
"imageGCHighThresholdPercent": 85,
"imageGCLowThresholdPercent": 80,
"volumeStatsAggPeriod": "1m0s",
"cgroupsPerQOS": true,
"cgroupDriver": "systemd",
"cpuManagerPolicy": "none",
"cpuManagerReconcilePeriod": "10s",
"memoryManagerPolicy": "None",
"topologyManagerPolicy": "none",
"topologyManagerScope": "container",
"runtimeRequestTimeout": "2m0s",
"hairpinMode": "promiscuous-bridge",
"maxPods": 130,
"podPidsLimit": -1,
"resolvConf": "/etc/resolv.conf",
"cpuCFSQuota": true,
"cpuCFSQuotaPeriod": "100ms",
"nodeStatusMaxImages": 50,
"maxOpenFiles": 1000000,
"contentType": "application/vnd.kubernetes.protobuf",
"kubeAPIQPS": 5,
"kubeAPIBurst": 10,
"serializeImagePulls": false,
"evictionHard": {
"imagefs.available": "15%",
"memory.available": "100Mi",
"nodefs.available": "10%",
"nodefs.inodesFree": "5%"
},
"evictionPressureTransitionPeriod": "5m0s",
"enableControllerAttachDetach": true,
"makeIPTablesUtilChains": true,
"iptablesMasqueradeBit": 14,
"iptablesDropBit": 15,
"failSwapOn": true,
"memorySwap": {},
"containerLogMaxSize": "10Mi",
"containerLogMaxFiles": 5,
"configMapAndSecretChangeDetectionStrategy": "Watch",
"enforceNodeAllocatable": [
"pods"
],
"volumePluginDir": "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/",
"logging": {
"format": "text",
"flushFrequency": 5000000000,
"verbosity": 0,
"options": {
"json": {
"infoBufferSize": "0"
}
}
},
"enableSystemLogHandler": true,
"shutdownGracePeriod": "0s",
"shutdownGracePeriodCriticalPods": "0s",
"enableProfilingHandler": true,
"enableDebugFlagsHandler": true,
"seccompDefault": false,
"memoryThrottlingFactor": 0.8,
"registerWithTaints": [
{
"key": "node-role.kubernetes.io/master",
"effect": "NoSchedule"
}
],
"registerNode": true
}
}
/assign I will work on this later. Feel free to work on it.
During the implementation of https://github.com/kubernetes/kubernetes/pull/10946, I met a problem here.
If I add taint to the kubelet configuration on a control plane node, we will upload it to the configmap kubelet-config for all nodes.
My current proposal would be
kubeadm init
kubeadm init
kuebadm join
kubeadm join
as the control-plane node(this logic is a special logic. the taint will be added to the yaml.)kubeadm upgrade
kubeadm join
as control-plane).Another proposal would be
While we established that the tainting is late and can cause problems we have not seen reports from more users about it. Paco, are you seeing the problem on your side with customers?
Before we proceed with PRs i think we need to get back to the discussion if we want to do it. It also feels a bit odd to separete the taint and labels. Perphaps we want them to continue to be done ny the kubeadm client instead of kubelet.
FYI we are having issues with the taints getting added late when adding a new control-plane node. Our scenario is that we have 3 physical nodes, and each physical node has two VMs, one is a control-plane and one is a worker. When we need to replace a failed physical node, it means we are replacing both of these VMs. The way the VMs are brought up, the control-plane VM usually joins the cluster before the worker VM. During the window between when the control-plane VM becomes available and when kubeadm applies the control-plane taint, kubernetes can schedule Pending and daemonset pods that do not have a toleration for the control-plane on the control-plane VM. We then have to go through a clean-up procedure to remove the pods that slipped in. In some cases all we have to do is delete the pods, but in other cases there are dependencies, such as on PVCs, that complicate the cleanup. Ideally the best solution would be to eliminate this window between when the node becomes schedulable and when the taints are applied.
Since we support kubeletconfiguration
patches, we can use --patches
to achieve this in v1.25 and later since https://github.com/kubernetes/kubeadm/issues/1682.
kubeadm join --patches=/tmp/kubeadm-patches xxx
The /tmp/kubeadm-patches/kubeletconfiguration+strategic.yaml looks like below.
registerWithTaints:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
Patches can work yes. Currently both the labels and taints are applied in the same phase - mark-control-plane. Which is nice and contained. If we are changing the phase we need a plan because users execute it on demand too.
I personally have not seen many reports of the problem discussed here and there are already workarounds by passing custom blocking taints on node registration and removing them after n minutes. Or i guess just passing a kubelet config patch.
EDIT: I wonder if we should just document this.
/unassign
EDIT: I wonder if we should just document this.
Agree.
While we established that the tainting is late and can cause problems we have not seen reports from more users about it. Paco, are you seeing the problem on your side with customers?
Before we proceed with PRs i think we need to get back to the discussion if we want to do it. It also feels a bit odd to separete the taint and labels. Perphaps we want them to continue to be done ny the kubeadm client instead of kubelet.
Paco, are you seeing the problem on your side with customers?
Not meet this recently. BTW, this is not a P0 or P1 bug, maybe P2-P4 issue as it happened during cluster maintenance.
It also feels a bit odd to separate the taint and labels. Perhaps we want them to continue to be done by the kubeadm client instead of kubelet.
If we move it to the kubelet, the taint will be back once the kubelet restarts. It would make some users confused if they remove the tainted on purpose. It seems to be a behavior change.
If we move it to the kubelet, the taint will be back once the kubelet restarts. It would make some users confused if they remove the tainted on purpose.
that is indeed a concern
If we move it to the kubelet, the taint will be back once the kubelet restarts. It would make some users confused if they remove the tainted on purpose.
that is indeed a concern
IMO, it is a design problem of kubelet. One solution that I know is that kubelet should only add the node labels or taint only if it is the first time bootstrap and register to apiserver. After the bootstrap(node is already added to cluster), kubelet should not change its node label or taint itself and it should respect the apiserver/etcd storage. But this is another topic. If so, any cluster installer can add the default taint/labels during joining. It would make no disturbance in later kubelet restarts.
true, so unless the kubelet changes its design we can play around it by adding the taints once and then removing them from the kubelet flags or local config. i must admit i am not a big fan of this solution as it creates more difficulties for us.
we can continue doing it in kubeadm with a client. perhaps sooner, but as far as i understand there could still be a short period when the taints are not there yet.
IMO this is not a very common issue. ideally users should create the control plane nodes and only then start applying workloads such as daemonsets.
To summarize,
--patches
when join new control plane. See https://github.com/kubernetes/kubeadm/issues/1621#issuecomment-1098649397.
Is this a BUG REPORT or FEATURE REQUEST?
Choose one: BUG REPORT
What happened?
Due we apply the control plane taint after the control plane comes up, on a multi-control plane case we can have pod scheduled to this control plane.
What you expected to happen?
Use the kubelet --register-with-taints config instead of handler it on a separate phase.
How to reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
For now, I just use this config but would be nice if kubeadm can handler it.