kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.96k stars 39.62k forks source link

Make the scheduler easily extensible #2313

Closed abhgupta closed 9 years ago

abhgupta commented 10 years ago

In its current form, if one needs a slightly different scheduler behavior, two things are required:

  1. A new predicate or priority function needs to be implemented (if not already available)
  2. The plugin/factory code needs to be re-written to consume the new predicate/priority function.

None of this is a big deal, but I would suggest that we make the existing scheduler implementation easily extensible by allowing users (admins) to specify which predicate/priority functions should be used. This can be done by specifying the predicate/priority functions via configuration that is provided to the scheduler factory. In the absence of the configuration, the existing functions can be used as default to initialize the scheduler.

The existing generic scheduler could then become the default platform-provided scheduler "engine" that did the following:

  1. Filter the list of available minions based on constraints/requirements
  2. Prioritize the filtered list
  3. Select a minion from the prioritized list

We should probably allow an array of functions to prioritize the minion list. These functions would perhaps be sequentially applied, but as long as the priority function is swappable, this is not a big deal. Finally, we could convert the mechanism of selecting a host (from the prioritized minion list) into a configuration-specified function. None of this would impact the out-of-the-box experience for users, but would greatly aid in handling a variety of use cases surrounding scheduling.

One concrete use case that is not handled today with the existing priority functions relates to regions/zones. We would like all pods within a service to be hosted on minions that are located within a certain region (identified by a node label). This can be handled today with NodeSelector labels on the pods. Within each region there are multiple non-affinity zones defined and there are multiple minions in each zone (think of a zone as a rack of servers/minions). The scheduler (in our use case) is expected to achieve non-affinity (good spread) across zones (a zone is specified using a label on minions as well). The existing spreading function does not have the ability to treat a group of minions within the same zone as same/similar and set host priority accordingly. For instance, if minion11 and minion12 are in zone1, and the service has a pod on minion11, then the priority function should be able to assign a low priority to both minion11 and minion12 and higher priority to the minions in other zones within the region.

abhgupta commented 10 years ago

Some part of this discussion was initiated in https://github.com/GoogleCloudPlatform/kubernetes/issues/1965 but thought it might be best to create a separate issue for this.

cc @smarterclayton @bgrant0607

lavalamp commented 10 years ago

I'm more or less on board with this, but won't have bandwidth to help for a few weeks, I think.

abhgupta commented 10 years ago

@lavalamp I might be able to take an initial stab at it, once we have basic consensus.

ddysher commented 10 years ago

I think we should first make the code structure more reasonable. Move pkg/scheduler into plgin/pkg and separate predicate / priority functions into its own package.

brendandburns commented 10 years ago

@abhgupta this seems totally reasonable to me. I'm happy to help with reviews.

@ddysher I'd rather not force a re-factor as a blocker for this work. Making it configurable, won't materially impact the difficulty of the re-factor, so I vote for making it configurable, and then adding whatever re-factoring we think is appropriate.

Personally, I don't think the factoring is too terrible (it could be better, but it's not that bad)

abhgupta commented 10 years ago

Great! I'll pick this up and have something to share with folks sometime next week.

lavalamp commented 10 years ago

I agree that it's a bit silly to have pkg/scheduler and plugin/pkg/scheduler, that the former should be moved to plugin/pkg/scheduler/algo or some such; but also that this refactor shouldn't block anything.

ddysher commented 10 years ago

Ah, didn't realize the refactor will be blocking. Let me rephase, we should also make the code structure more reasonable. :)

bgrant0607 commented 9 years ago

I'm on board with the configurability.

How to schedule across regions should be forked into another issue. I don't (and won't ever) recommend using a single Kubernetes cluster across regions in a production setting.

bgrant0607 commented 9 years ago

@abhgupta An important use case: combining spreading and resource-based scheduling. Any progress on this? Any questions?

abhgupta commented 9 years ago

This comment makes sense here...

I had been distracted by other work over the last couple of days but have been able to make progress on refactoring the scheduler to make it configurable. I should be able to share my changes for early feedback tomorrow.

In addition, I am now working to allow multiple priority functions to be specified and the scores will be combined (feedback requested for simple normalizing) for each minion before presenting the prioritized list of minions to the selectHost function for picking one. I hope to have that one ready to share tomorrow as well.

erictune commented 9 years ago

In #367 there was discussion about how to weight least-loaded vs spreading. Some thoughts:

That is, if you use one machine per pod, then you get great spreading, and great isolation (the main reason you do least loaded). And you get horrible packing efficiency.

So, we shouldn't put a lot of effort into deciding how to weight spreading versus least-loaded.

abhgupta commented 9 years ago

@erictune My initial thoughts were around having something simple as a start.

Require each priority function to provide a minion score between 0 - 100. Then allow the scheduler configuration to take in a simple "weight" (positive numeric value) for each priority function. The minion score provided by each priority function can be multiplied by the "weight" (default weight would be 1) and then combined by just adding the scores for each minion provided by all the priority functions. The person deploying the cluster has the option to provide different weights to priority functions and hence can exercise basic control over multiple priority functions acting together.

Complicated cases definitely exist - but a simple way of combining priority functions might be helpful when one function was to prioritize based on preference and another priority function wants to prioritize based on aversion to particular minions -- something like a LeastPreferredPriority.

abhgupta commented 9 years ago

@bgrant0607 would like your thoughts on combining priority functions as well.

bgrant0607 commented 9 years ago

@abhgupta In practice, we've found that a strict ordering of the priority functions mostly works fine. However, combining using weights is easy to implement, so I'm happy to start with that. How about restricting scores to 0-10? That way, it's much easier to reason about how to combine functions when a strict prioritization is desired. Besides, practically speaking, super-fine resolution of priorities isn't going to be useful.

abhgupta commented 9 years ago

@bgrant0607 my rationale for 0-100 was not granularity, but rather just playing well with percentages. But I am fine with 0-10 as well. It does make it more useful since it allows minions with "similar" scores to be picked up for selection.

bgrant0607 commented 9 years ago

The initial version of this is done, and there's a separate issue for configurability.