Open x13n opened 1 year ago
/help
@x13n: This request has been marked as needing help from a contributor.
Please ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
@x13n I'd be glad to work on this issue, what is the timeline? How can we coordinate this?
I think implementation can start right away. Suggested timeline:
/cc @MaciekPytel @gjtempleton @mwielgus
Any concerns with this approach?
In case people are tolerating only the existing taints, do we need to communicate the change a bit more than usual? Eg article on https://blog.k8s.io/
I don't have a strong opinion on this. On one hand I'm not sure what would be the use case for scheduling pods on a node that is being deleted. On the other hand, I guess it is better to over-communicate. How do we make a release note appear on the blog?
It'd be a new blog article - see https://kubernetes.io/docs/contribute/new-content/blogs-case-studies/ (which is a bit wrong but is mostly correct).
It looks like none of the CA taints are the NoExecute
type. If that's right, I don't think a blog article is needed.
In various places we use HasToBeDeletedTaint()
function, to check if the node is marked to be deleted. Since we have two taints, how do we consider a taint to be marked for deletion? I can think of two options:
The autoscaler will put both the old and new taint at one request, but what if the user somehow removes one of them? In that context, it might be safer to check if at least one of them exists, and does not require both. Please share your opinion on this.
I was reiterating our plan, and have some questions:
The scenario I can think of is that users' Pods are tolerating the existing taint, so they will need to tolerate the new taint too. In that case, if we want to keep backward compatibility, they will need to pass a flag and ask the autoscaler to use the old taints. If we use double tainting, how does the user benefit from it? Please provide any context I'm missing.
Cc: @x13n
Some thoughts for implementing double tainting:
- We will need to support the case a user needs to keep the previous behavior (i.e., use the older taint only). To do so we have to pass a boolean option from the autoscaling options to taint-related functions. Currently, no autoscaling options are passed to these functions, so some restructuring might be needed.
I think we actually need not bool, but an enum (so, in practice - string) flag to allow one of 3 values: old / new / both.
- We will modify our taint methods to update the node object at one request, i.e., no extra requests to the API Server.
In various places we use
HasToBeDeletedTaint()
function, to check if the node is marked to be deleted. Since we have two taints, how do we consider a taint to be marked for deletion? I can think of two options:
- Require the node to have either of the taints (logical OR)
- Require the node to have both taints? (Logical AND)
The autoscaler will put both the old and new taint at one request, but what if the user somehow removes one of them? In that context, it might be safer to check if at least one of them exists, and does not require both. Please share your opinion on this.
I agree, OR seems to be the right check.
Some more questions for the plan
I was reiterating our plan, and have some questions:
- What is the purpose/use case of double tainting?
- Can't we simply introduce the new taints, with a flag to switch back to the old taints for the next release, which we will gradually remove (the flag) in upcoming releases? In the meantime, we will notify users about the upcoming changes in our Changelog (so that they have time to prepare).
The scenario I can think of is that users' Pods are tolerating the existing taint, so they will need to tolerate the new taint too. In that case, if we want to keep backward compatibility, they will need to pass a flag and ask the autoscaler to use the old taints. If we use double tainting, how does the user benefit from it? Please provide any context I'm missing.
Cc: @x13n
I think in case of tolerations you're right, double-tainting doesn't seem to provide a lot of benefit - there could be two tolerations on workloads to allow gradual rollout. The use case I had in mind was actually for other dependency: when some other controller (like this one) uses the taint as a signal that CA will be deleting the node. To allow such use cases to migrate, we give them two signals that will coexist for at least 1 release cycle.
Heads up @thockin @alexanderConstantinescu
Hi @x13n @gregth I see from the discussion here that implementation approach is not finalized yet. I'm new to this project but I would be happy to give it a try if you think this issue is still relevant.
Thanks for volunteering! I think there was no discussion for a while now, so I'd proceed with the first step from my original description: double-tainting (flag gated and implemented within a single API call, as @gregth suggested). The flag should default to double tainting, but allow also "only new" and "only old".
In the light of https://github.com/kubernetes/org/issues/4258, we should probably align the prefix between components, to avoid migrating twice. @MaciekPytel @mwielgus @gjtempleton - thoughts?
Sorry for such a late response on this. But I'm back now and starting today. As agreed above, I will have double-tainting with a flag to opt in for "only new" and "only old" or "both" as default. /assign
Thanks! I think recently there were some discussions around unifying labels between CA and Karpenter now it is a part of the SIG. Need to make sure we don't need to migrate twice. @towca can you chime in on this?
czw., 15 lut 2024, 09:07 użytkownik Feruzjon Muyassarov < @.***> napisał:
Sorry for such a like response on this. But I'm back now and starting today. As agreed above, I will have double-tainting with a flag to opt in for "only new" and "only old" or "both" as default. /assign
— Reply to this email directly, view it on GitHub https://github.com/kubernetes/autoscaler/issues/5433#issuecomment-1945546537, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFMP4DABDVHVS76PYQ2MSDYTW65PAVCNFSM6AAAAAAUAVOWPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBVGU2DMNJTG4 . You are receiving this because you were mentioned.Message ID: @.***>
@x13n I'm almost done with the change and want to check something regarding taint format before submitting the changes. The new format looks like below:
ToBeDeletedTaint = "to-be-deleted.cluster-autoscaler.kubernetes.io/"
DeletionCandidateTaint = "deletion-candidate.cluster-autoscaler.kubernetes.io/"
Basically instead of using camelCase I have used -
and added cluster-autoscaler.kubernetes.io/
suffix. Does the format look correct or as you expected ?
Hey @fmuyassarov! As @x13n mentioned above, we're currently in discussions with Karpenter about unifying the common parts of our APIs. These taints are one such common part, and we still need to align on the exact API group and naming for them. Could you hold off your change until that happens? I can reach out when we have the naming ready. Unfortunately right now it's hard for me to predict when exactly that'll be (probably a couple of months), but I'd really like to avoid changing the API twice for this.
we're currently in discussions with Karpenter about unifying the common parts of our APIs.
Hey @towca. Yes, it totally makes sense to keep it on hold for now and thanks for the update. Is the discussion with the Karpenter community being tracked somewhere on GitHub where I can follow it too?
/hold
cc: @jonathan-innis, @ellistarn, @njtran
we're currently in discussions with Karpenter about unifying the common parts of our APIs.
Hey @towca. Yes, it totally makes sense to keep it on hold for now and thanks for the update. Is the discussion with the Karpenter community being tracked somewhere on GitHub where I can follow it too?
I think I found it, probably the discussion on happening on this google doc: https://docs.google.com/document/d/1_KCCr5CzxmurFX_6TGLav6iMwCxPOj0P/edit
I think I found it, probably the discussion on happening on this google doc
Hey 👋🏼 chiming in here from the Karpenter maintainer team. I don't think we have really gone back to that doc for active discussion in a bit. I've been talking with @towca a bit on this but I agree that it's going to take a bit of time for both of the projects to come into an alignment. I'd be open to starting a discussion over in the K8s slack with everyone who wants to be involved. We could just do it in #sig-autoscaling channel, though I expect that it might be lost in conversation. It would be really nice if we could just call this a "working group discussion" and just get a "wg" channel started in the K8s slack so we can get all of the relevant parties involved and have an active discussion over there specifically on this topic.
@jonathan-innis I'll look into setting up the proper communication channels to discuss this.
Which component are you using?:
Cluster Autoscaler
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
Existing CA taints don't follow the k8s naming pattern like our annotations do (
cluster-autoscaler.kubernetes.io/...
)Describe the solution you'd like.:
We should migrate to new taints and document them on https://kubernetes.io/docs/reference/labels-annotations-taints/
Steps:
Describe any alternative solutions you've considered.:
Additional context.:
See https://github.com/kubernetes/website/issues/45074