Using nodeSelector/tolerations with Serving?

jeremyrickard commented 6 years ago

👋 This is kind of part question/feature request!

Could Revision be extended to support using Tolerations and/or Node Selectors? I get the reason why it doesn't include the entire pod spec, but I think some additional attributes to influence scheduling could be useful. I looked through the current types and I don't see how I can do this today, but I might have missed something!

For example, I might care about where my thing gets scheduled if....

I have a hybrid cluster with both Linux and Windows nodes. I've got some things I want to run with KNative that need to run on the windows nodes. Those nodes are tainted, so I'd need to utilize label selectors / tolerations to make that a thing.

In most cases, I probably don't care where things run because serverless...but somettimes I think there are. This change seems like it would be useful and not have a ton of impact to user experience or complexity.

I'd be interested in working on this if it's something that seems palatable. cc: @krancour

jeremyrickard commented 6 years ago

Whoa....knative-prow-robot added lots of labels! 🌈

grantr commented 6 years ago

@jeremyrickard Somewhat surprisingly, the robot sees commands in html comments. If you edit your issue body to remove those, it might remove the labels too (not sure if it notices edits).

jeremyrickard commented 6 years ago

Ha, thanks @grantr. I totally skipped over:

Pro-tip: You can leave this block commented, and it still works!

My bad 😫

Doesn't look like prow updated the labels, sorrrry (i still like you prow!)

grantr commented 6 years ago

Taking a stab at cleaning up the labels @jeremyrickard, feel free to modify as you see fit.

/remove-area autoscale /remove-area build /remove-area monitoring /remove-area networking /remove-area test-and-release

/remove-kind question /remove-kind bug /remove-kind cleanup /remove-kind doc /remove-kind good-first-issue /remove-kind process /remove-kind spec

jeremyrickard commented 6 years ago

Thanks @grantr! 🙇

krancour commented 6 years ago

/assign krancour /assign jeremyrickard

mattmoor commented 6 years ago

Let's discuss at the API WG?

mattmoor commented 6 years ago

We should also probably hold on this until we've firmed up our API versioning story, and talk about it for v1alpha2 (or w/e is next).

In the meantime, I think you can achieve something similar via a webhook that injects this into the Deployments we create (should have several labels you can key off of).

I'd also love to have a conversation about what is needed to get things working across both of your node pools (if you are up for it, let's open another issue?).

jeremyrickard commented 6 years ago

In the meantime, I think you can achieve something similar via a webhook that injects this into the Deployments we create (should have several labels you can key off of).

Ahh, that's an idea I can look into!

krancour commented 6 years ago

@mattmoor happy to discuss anytime. Love @jeremyrickard's use case, but I also think, in general, that because Revision is composed (among other things) of ContainerSpec and not PodSpec (and rightfully so, afaict), all support for influencing where a workload runs is completely absent at present. I'm sure that where a workload runs is exactly the kind of thing we'd rather Knative users not concern themselves with, but with all nodes certainly not being created equal, I'm really not sure how we could afford the luxury of completely ignoring such a concern.

In the meantime, I think you can achieve something similar via a webhook that injects this into the Deployments we create (should have several labels you can key off of).

That sounds significantly less elegant to me, tbh.

mattmoor commented 6 years ago

That sounds significantly less elegant to me

Ack. At the moment I'm being pretty paranoid about API versioning, until we firm up our story (in part blocked on CRDs having a story). The suggestion itself is based on our (evolving) guidance for how Knative operators/vendors would augment the runtime contract with their own extensions. Sidecar injection is a similar pattern to this, and also arguably overkill.

TBH, with multi-platform clusters (where most images won't "just work") you face this problem across the board (not just Knative). It'd be a pretty nifty mutating admission controller that looked at PodSpec, cracked open the image(s) and automatically injected these annotations. That'd likely solve this problem for Knative, but also avoid what I imagine is a common DevX pitfall on these clusters. I'd point you at https://github.com/google/go-containerregistry, but I don't think we've added support for manifest lists yet (nudge @jonjohnsonjr ).

mattmoor commented 6 years ago

You could also validate images against user-provided nodeSelector constructs and return a friendlier error than whatever cryptic image pull back-off they'd get otherwise.

krancour commented 6 years ago

We probably shouldn't look at this only in terms of what host OS is required by a given container and present on a given node. There are certainly other reasons that workloads could be attracted to or repelled from certain nodes.

mattmoor commented 6 years ago

@krancour That's fair, I still think it'd be neat for platform/architecture :)

krancour commented 6 years ago

@mattmoor, ok... what are next steps then?

mattmoor commented 6 years ago

I put it on the agenda for the next API WG. Even with unanimous support, I think I'd like to understand what our versioning story is going to look like before we start making API changes (cc @jonjohnsonjr ).

maaraoffl commented 6 years ago

I am happy to see this feature implemented. NodeSelector & Tolerations will be quite helpful to target nodes for function deployment in multi-cloud cluster.

markusthoemmes commented 5 years ago

There's an ongoing thread on users list about the requirements for this feature. We're looking at user feedback to decide if we implement this. https://groups.google.com/forum/#!topic/knative-users/cRP14_hAybM

Why debating anyway?

Doing this has the potential to break abstractional layers in that suddenly the user gets to decide which nodes will be used to schedule her pods. In principle, that might be fine but it has the potential to limit the liberty that Knative can take to schedule pods around.

Use-cases so far:

mixed GPUs
mixed Windows/Linux
mixed pre-emptible nodes
virtual kubelet experimentation

DavidLangworthy commented 5 years ago

Is there any way to get this working today?

I need to assign some pods to particular pools. I've tried attempting to patch the generated deployment (yes, I know this is bad). The patch returns successfully, but does not alter the deployment.

This is the patch I am using

[
    {"op": "add", "path": "/spec/template/spec/nodeSelector", "value": {"agentpool":"fabrikam"}},
    {"op": "add", "path": "/spec/template/spec/tolerations", "value": 
        [
            {
                "key": "partition",
                "operator": "Equal",
                "value": "fabrikam",
                "effect": "NoExecute"
          }
        ]
    }
]

mattmoor commented 5 years ago

We'd reconcile the change away, which is probably why it seems like the patch wasn't applied. You could do this via a mutating admission webhook on Deployments and we should respect that.

mattmoor commented 5 years ago

It also seems you can do this at a namespace level today: https://stackoverflow.com/questions/52487333/how-to-assign-a-namespace-to-certain-nodes

DavidLangworthy commented 5 years ago

Thank you. I'll look into those options.

krancour commented 5 years ago

@mattmoor a mutating admission webhook would work, but isn't implementing one of those an awful lot to ask of users who have this relatively common need? It's a solution that is disproportionately complex in relation to the problem is solves.

I totally understand that granting users power over pod placement is antithetical to the notion of serverless. Having said that, Knative isn't merely serverless-- it's quite specifically serverless on Kubernetes. Indeed, it's inextricable from its dependency on Kubernetes. There is literally no one who uses Knative who isn't accutely aware that there is Kubernetes under the hood. It's not really hidden, after all. So isn't Knative already fundamentally in violation of the academic argument that serverless platforms should expose nothing about the underyling infrastructure? And if so, what practical purpose is served by not surfacing a useful Kubernetes feature? It's not as if Kubernetes would be completely hidden from view if it weren't for those pesky node selectors.

And please don't get me wrong... I have no emotional attachment to my PR or the API changes it implements. I'm not arguing for any specific implementation here so much as I am arguing that there should be some way to do this.

mattmoor commented 5 years ago

I agree that writing a webhook is complex, my latest comments are simply enumerating some options that exist today.

Frankly, options like:

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/node-selector: env=production

Are appealing because it separates our node-selection to a resource (Namespace) that is likely set up by an operator administering the cluster for a developer. Separating Developer and Operator concerns is a place Kubernetes has done a poor job historically.

IDK if anything similar exists for affinity, or whether the annotation has any real future, but it's probably a better direction for separating the concerns of these roles / personas in the long run.

There is literally no one who uses Knative who isn't accutely aware that there is Kubernetes under the hood

If you mean operators, then sure. If you mean developers, then I can think of quite a few.

this relatively common need

I agree that this is common enough that I'd like a solution, but I'm not sure it is common enough to warrant rushing into it. After all, it seems Kubernetes hasn't fully made up its mind with respect to "node selectors" vs. "node affinity". Do we need to support both? Should we stick with what K8s recommends?

We didn't want to lose the record of use cases, which is what we've been using this issue to accumulate.

krancour commented 5 years ago

@mattmoor re: node selection on the basis of namespace, that assumes that everything in that namespace should have the same node selectors. I'm sure there are cases where that is fine. I'm equally sure that that's not ok in many cases.

If you mean operators, then sure. If you mean developers, then I can think of quite a few.

Regardless of whether one counts it as a developer responsibility or an ops responsibility, I'm talking about whoever ends up contributing the Kubernetes manifests that contain Knative CRDs. That person cannot help but be aware of Kubernetes. That's what prompted me to say, "It's not as if Kubernetes would be completely hidden from view if it weren't for those pesky node selectors."

but I'm not sure it is common enough to warrant rushing into it.

This issue is a year old. 😉

After all, it seems Kubernetes hasn't fully made up its mind with respect to "node selectors" vs. "node affinity". Do we need to support both? Should we stick with what K8s recommends?

This is a solid point, but it's also worth asking how long Knative users should go without an easy solution for this common problem while we wait for k8s to figure this out. (I can't answer that. 🤷‍♂ )

DavidLangworthy commented 5 years ago

@mattmoor's comment fixed my immediate problem so I am unblocked. Thank you.

Regarding the overall issue--

Our use case deals with a single cluster with node pools of different compute types-- CPU, GPU & FPGA. Even within GPU there are different types e.g. K80 and V100. This difference is already exposed at the dev (or data scientist) layer. In PyTorch the code will call .to('cuda') to load model and parameters for evaluation.

To manage this in kubernetes we use both nodeSelector and tolerations. Taint/tolerations are required because a V100 is an expensive resource and we cannot allow workloads that do not require it to run on those nodes.

We would like to keep the namespace for logicial grouping of services and not couple it to physical properties. Using PodNodeSelector and PodTolerationRestriction will require we spawn qualified namepaces e.g. fabrikam-cpu, fabrikam-k80 . . . So, placing the "right" abstraction into KNative would help maintain a logical/physical separation.

I am not sure exactly what that abstraction is. Simply flowing through nodeSelectors and tolerations would work, but might not be the best. It seems that there is enough interest that some facility should be put into place. If it turns out that it is not the right facility or that kubernetes evolves in this area, the versioning scheme is strong and will keep running systems from breaking.

DavidLangworthy commented 5 years ago

FWIW, I tried the PodNodeSelector along with the PodTolerationRestriction which is also needed. There was some interaction with Istio that prevented KNative from deploying properly. So unless someone else has gotten this to work, then this out of the box solution will not work.

I'm going to take another pass with an Admission Controller webhook.

juliendangers commented 5 years ago

We have the same mixed CPU/GPU use case than @DavidLangworthy.

And we also have some occasional workloads that require much more resources than our usual workloads. We'd like to use dedicated nodes for that, and take advantage of scale to 0 feature of knative, so cluster-autoscaler will spin a new node if a pod is scheduled (we don't care about starting time), and remove it once there's no more pod on it. We want to be sure that no other workload is scheduled on those nodes (because they're more expensive...), so taints and tolerations are a good (the only?!) fit here.

But as a user, having to write a webhook to achieve that seems way too complex 😭 At least may be an example webhook would reduce the complexity of it until a real solution is found. But we cannot expect every user to implement a custom webhook to achieve things they easily use otherwise.

patrickshan commented 5 years ago

Our use case for this is that some services request to run on a special type of cloud instances, like compute intensive workload to run on AWS c5 instances. currently we have different ASGs with different taints and different instance types. By using node selector and toleration, we could schedule those services to their dedicated node group.

cyrildiagne commented 5 years ago

Does anyone knows how it works for GPU tolerations on GKE?

Because on GKE, if I specify an nvidia.com/gpu: 1 resource spec in the template of my Knative service, the pods created by knative are automatically annotated with the appropriate nvidia.com/gpu toleration.

Endofunctor commented 5 years ago

@DavidLangworthy Did you work out a solution to this? Serving seems to be useless for machine learning workloads without it.

DavidLangworthy commented 5 years ago

We have implemented an operator and webhook to manage resource allocation and some isolation. It is not specific to KNative, but works fine with it.

knative-housekeeping-robot commented 4 years ago

Issues go stale after 90 days of inactivity. Mark the issue as fresh by adding the comment /remove-lifecycle stale. Stale issues rot after an additional 30 days of inactivity and eventually close. If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

mattmoor commented 4 years ago

/remove-lifecycle stale

We need to figure out something here.

DavidLangworthy commented 4 years ago

We (Azure ML) have implemented something that works orthogonally to KNative (and KF Serving) that does what we and several others have asked for. It allows KNative services to be completely unaware of compute details. An admin publishes a CR that creates an agent pool of the desired SKU. This agent pool is associated with a namespace and anything that is published to that namespace is bound to that agent pool via a web hook.

Would this be of interest?

mattmoor commented 4 years ago

I'd definitely like to understand more. We've been trying to find the right balance between allowing leaky operator stuff into Developer-facing API and this is definitely one of the canonical fields that hits that. There are other types of fields that face similar concerns (runtime class, dns policy), so I'm hoping we will be able to devise a general enough solution that we can knock off effectively all of them (if we can).

Would you be open to coming to an API WG call and demoing?

DavidLangworthy commented 4 years ago

I'd be happy to join a call. What's the schedule?

mattmoor commented 4 years ago

Typically every Weds at 10:30 PST.

krancour commented 4 years ago

@DavidLangworthy

This agent pool is associated with a namespace and anything that is published to that namespace is bound to that agent pool via a web hook.

Something like this was discussed previously in this thread. I'd like to point out now, as I did then, that it's probably not a valid assumption that everything in the same namespace should always execute on the same type of node.

duglin commented 4 years ago

Have a PoC that works via labels. Set the right label and it'll direct that Revision only to nodes that are tagged similarly/appropriately. How those nodes are created/tagged, access controlled to ensure only appropriate users can use them, are out none of the user's concerns.

krancour commented 4 years ago

@duglin I'm missing something, I'm sure. It's not obvious to me how basing it on labels is effectively different than node selectors and tolerations.

The implementation details may be different between the two, sure, but isn't it true in both cases that the user requires a tacit recognition that their workload is dependent on a certain type of node and the user must also know the "magic incantations" to effect scheduling on such node? What difference does it make if that "magic incantation" is a node selector or a label?

Since I assume I am misunderstanding your suggestion, maybe you can set me straight?

duglin commented 4 years ago

Nope, you're right it's basically the same thing. Only difference is that the node selectors/toleration are hidden behind the label. Only reason i mentioned it is because some seemed concerned about complicating the UX with the kube-specific details.

krancour commented 4 years ago

@duglin gotcha. Thanks.

Only reason i mentioned it is because some seemed concerned about complicating the UX with the kube-specific details.

Personally, I'd struggle to see that as an improvement only because node selectors and tolerations are configurations with fairly well-understood semantics behind them. By contrast, using labels for this would obfuscate behavior a bit by requiring special foreknowledge of how a given label affects placement.

duglin commented 4 years ago

I kind of view it similar to how we use annotation to control things like min/max scaling, and people seem ok with those. Only reason labels were used was because kn used -l for labels, where it doesn't support -a for annotation - so less typing :-) But, if this became "a thing" then a dedicated kn flag that maps to annotation would work just as well.

krancour commented 4 years ago

@duglin ack.

I get it. But with annotations, I'm willing to bet that you frequently find yourself referencing docs for product x to find what annotation does what. I know I do. It is what it is, but is it optimal?

duglin commented 4 years ago

I think it depends on how you're viewing the UX. On the one hand I totally agree that allowing people to extend the podSpec to include Tolerations and/or Node Selectors would be nice because then it's a nice migration path for existing Kube users.

The flip-side is that I'm not sure I would want to expose that via knor even via the yaml as the "simplified" way to do things, it feels a bit too technical for a UX that's trying to simplify the Kube experience. Matt's suggestion of an annotation like:

scheduler.alpha.kubernetes.io/node-selector: env=production

has some appeal to me because I think that could be easily mapped to a nice UX in kn, and TBH is actually pretty close to what have in our PoC.

I can see a case for allowing both, we'd just need to flag it as an error if they do both at the same time via yaml.

sumowrestler commented 4 years ago

I've just hit this issue now, made assumptions it would work with labels - going to give the namespace workaround a try. would really appreciate having an ability to do this without complexity. UPDATE: EKS doesn't support this: https://github.com/aws/containers-roadmap/issues/304

knative-housekeeping-robot commented 4 years ago

Issues go stale after 90 days of inactivity. Mark the issue as fresh by adding the comment /remove-lifecycle stale. Stale issues rot after an additional 30 days of inactivity and eventually close. If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

duglin commented 4 years ago

Where do we stand on this? Ignoring the syntax of how to expose the option (annotation vs podSpec fields), is this something people are ok with exposing?

mattmoor commented 4 years ago

I was just talking to @dprotaso about this today. Now that we have config-features I think we should add this as something that can be allowed through like we have for kubernetes.field-ref (thanks to @JRBANCEL).

@duglin Do you have someone who might be interested in plumbing this through in 0.17?

knative / serving

Using nodeSelector/tolerations with Serving? #1816

Why debating anyway?

Use-cases so far: