The Kenzo loop will now account for available scheduling capacity before "assigning" a job to an acceptable compute cluster. This is done via pos? filter in the distribution function.
Kubernetes compute clusters now report their "max launchable" stat, which is a measurement of pods without assigned nodes. This is our best approximation for pressure inside the Kubernetes Scheduler.
The Kenzo handler will not try scheduling more jobs than total available scheduling capacity across all compute clusters. This reduces expensive work for jobs that would be filtered out at distribution stage.
The Kenzo handler will short-circuit and then delay the next scheduling loop if there isn't sufficient scheduling capacity (configurable). This will prevent incurring expensive cycle overhead for a small number of considerable jobs.
Why are we making these changes?
This is a backpressure mechanism needed to efficiently use the Kubernetes Scheduler for real job pods.
Newly Identified Future Work
Move synthetic pod config on cc-template to new "autoscaling" block, in addition to max-outstanding, etc.
Changes proposed in this PR
pos?
filter in the distribution function.Why are we making these changes?
Newly Identified Future Work