openfaas / faas

OpenFaaS - Serverless Functions Made Simple
https://www.openfaas.com
MIT License
25.17k stars 1.94k forks source link

Proposal: Kubernetes tolerations #1125

Closed alexellis closed 4 years ago

alexellis commented 5 years ago

Feature: Kubernetes tolerations

https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/

Expected Behaviour

Tolerations allow a Pod (function) to be scheduled on a node where there is a "taint" preventing Pods from being scheduled there.

Current Behaviour

We have support for constraints and labelled node-pools for scheduling, but there is a scenario a user came with where a node-pool has a taint and so constraints are not enough - they need to also add a "toleration" to the function Pod.

Possible Solution

We could extend the function schema to allow the Kubernetes tolerations spec to be specified.

This is not available in Swarm and possibly not available in the other back-ends, so it's the first hard requirement for orchestrator-specific knowledge in the OpenFaaS API.

Suggestions are welcome.

Steps to Reproduce (for bugs)

  1. Taint a node-pool
  2. Deploy a function with a constraint to run in that node-pool
  3. The function cannot be scheduled
  4. Apply a manual or programatic taint

Context

This affects advanced scheduling scenarios on Kubernetes.

Your Environment

mercul3s commented 5 years ago

Using annotations for this purpose sounds pretty reasonable. It's already supported within the OpenFaaS API, and should only require modification to faas-netes to work with kubernetes - ie looking for a tolerations key in the annotations map, and adding its value to the deployment spec. This would ensure the OpenFaas API doesn't have to be aware of kubernetes specific tolerations, while still allowing support them via faas-netes.

alexellis commented 5 years ago

I chatted with @stefanprodan and @embano1 - we had an idea for a way for users to extend OpenFaaS.

If you use a Mutating Webhook Admission Controller [1] then you can look for your annotation and act upon it as soon as the Pod is requesting creation. You get all the benefits of automation, extending the API in an open-closed way and don't create any engineering burden on the community. The slok [2] project can be used to put this together in a very short period of time.

Stefan suggests you'll need the CA from the cluster:

kubectl get configmap -n kube-system extension-apiserver-authentication -o=jsonpath='{.data.client-ca-file}' | base64 | tr -d '\n'

To sign the TLS key needed for the controller.

This may even be possible to deploy as an OpenFaaS function itself.

[1] https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook-beta-in-19

[2] https://github.com/slok/kubewebhook

This could be a great example of how to extend OpenFaaS on Kubernetes for others to follow, too.

mercul3s commented 5 years ago

@alexellis ok, let me give that a try - Admission Controllers are new to me, but kubewebhook looks pretty straightforward. I'll follow up with results here.

alexellis commented 5 years ago

Keep us in the loop on it and feel free to ask questions on Kubernetes Slack or in our own #kubernetes channel on OF Slack.

alexellis commented 5 years ago

@mercul3s any progress on this? cc @stefanprodan

mercul3s commented 5 years ago

@alexellis Sorry for the late reply, I've been afk and out of the country the past week and a half. I did not have any luck getting an admissions controller working using kubewebhook, as I ran into a bug using their example code (panics on error handling, from what I can tell). I have some other examples to work from, so I'm going to give it another shot over the next couple of days and hopefully will get something working.

alexellis commented 5 years ago

Good to know. I'm exploring how to extend the CLI to accept new arbitrary values and this will pave the way for additional Kubernetes deployment constructs. https://github.com/openfaas/faas-cli/issues/623

Depending on your timeline you could either carry on or help us implement the extension end to end.

mercul3s commented 5 years ago

@alexellis I like that idea - going to read through and will add comments to that issue. Would be awesome to see some kind of support for tolerations within openfaas, as I would consider mutating controllers to be a workaround rather than full solution for supporting tolerations.

I'd definitely be willing to help implement kubernetes extensions. I'm also going to continue to work on getting a mutating controller up and running for the time being as having some support for tolerations means that we'll actually be able to use Openfaas in production now.

alexellis commented 5 years ago

I would consider mutating controllers to be a workaround rather than full solution for supporting tolerations

I agree, it is on the backlog.

I'd definitely be willing to help implement kubernetes extensions

💯

Are you on the Slack workspace yet? It might be worth joining.

-- Join Slack to connect with the community https://docs.openfaas.com

mercul3s commented 5 years ago

@alexellis yup, I'm on the slack and in the kubernetes channel.

anmtan commented 5 years ago

We use dedicated nodes to run different functions. It will be great to see openfaas support Kubernetes Tolerations.

alexellis commented 5 years ago

Hi everyone, this is an important feature on the roadmap. From what I understand so far, a generic or one-time toleration for all functions may be suitable. We could apply this to the OpenFaaS Kubernetes controllers at deploy time.

This is also simpler to add for those who feel this is a blocking issue.

Dedicated node pools can be used with Kubernetes constraints in the stack YAML where taints haven't been used.

See also: linked faas-cli issue above on how to map these very complex, leaky structures into the OpenFaaS YAML.

Cc @ibuildthecloud

andeplane commented 4 years ago

We at Cognite are also interested in this functionality to run all functions on a node pool with specific settings (preemptible nodes, different resources etc).

alexellis commented 4 years ago

There are a couple of people that have mentioned this topic recently. For most people a node label and a constraint may be enough for them. this issue is specifically looking at scheduling to nodes which have a taint on them.

ghost commented 4 years ago

To add, at Redscan Cyber Security we were also looking for this feature. I'm now using the contstraints to make use of the nodeSelector directive in k8s. Works well, you essentially:

  1. Label your node with whatever key value pair you like;
  2. Add a constraint with a matching key value pair in the yaml function definition as below:
    constraints:
    - <key>=<value>

This causes the function pod to deployed to the namespace with a matching label.

alexellis commented 4 years ago

Closing in favour of https://github.com/openfaas/faas-netes/issues/586

alexellis commented 4 years ago

/lock