Kubernetes Bin Packing

Configure pools to submit real job pods directly to Kubernetes, which handles the scheduling and cluster autoscaling. Pools that use the Kubernetes Scheduler will not use the "match-offer" loop, Fenzo bin packing algorithm, nor synthetic pods.

This is a proof of concept that will require a few iterations to get to a production state. However, my internal tests shows this feature behaves well. The following is my current understanding of the gaps between this changeset and a version we can test with real jobs. I'm sure there are more, please let me know and I can document it here.

Configuration

2139 introduced configuring schedulers per pool. Here is an example of running the Kubernetes Scheduler feature in one pool and Fenzo in all others.

:pools {
  ...
  :schedulers [{:pool-regex "^k8s-exp$"
                        :scheduler-config {:scheduler "kubernetes"
                                           :max-jobs-considered 1000}}
                       {:pool-regex ".*"
                        :scheduler-config {:scheduler "fenzo"
                                           :good-enough-fitness 1.0
                                           :fenzo-fitness-calculator "com.netflix.fenzo.plugins.BinPackingFitnessCalculators/cpuMemBinPacker"
                                           :fenzo-max-jobs-considered 400
                                           :fenzo-scaleback 1
                                           :fenzo-floor-iterations-before-warn 10
                                           :fenzo-floor-iterations-before-reset 1000}}]
}

Remaining Work

Real job pods now trigger autoscaling. In both Fenzo and Kubernetes Scheduler pools, the instance runtime begins when the pod is submitted to Kubernetes. This is usually fine in the Fenzo case, because a pod is only launched when there is available space for it on a node. In this case, runtime is pretty close to when the pod is actually in the Running state (although it does count initialization). However, for the Kubernetes Scheduler pools, a pod could trigger autoscaling, leading to 5 to 10 minutes of inaccurate runtime. We should consider starting instance runtime in both cases when the pod is actually running on the node.
A backpressure mechanism remains, in order to moderate the amount of pending pods in each GKE cluster.
Proper metrics accounting in the new Kubernetes handler and code. This functionality has been a bit of a moving target because of Sam's great work on it over the past quarter. I envision there will be a round of review and edits for this specifically at a later date.
Audit values of the instance saved to Datomic. For example, do we need to set ports, slave-id, etc?

twosigma / Cook

Initial implementation for submitting jobs directly to Kubernetes Scheduler #2150

Kubernetes Bin Packing

Configuration

2139 introduced configuring schedulers per pool. Here is an example of running the Kubernetes Scheduler feature in one pool and Fenzo in all others.

Remaining Work