Better estimate each job's resource requirements and encode them into Biokepi as defaults

All right — continuing my scattered idea storms: one of the things I realized running all those pipelines over and over again (to workaround random middle-node failures) was that we are not doing a good job of utilizing the clusters to their full potential. Two empirical observations:

Since we are always asking for the maximum available by default (50 GB of memory and 7 CPUs), kubernetes rarely schedules more than one job to each node since even a bit of memory usage by the active tasks make the scheduler think that there is not enough room on that node. Especially during times when nodes are busy with relatively long running but low-memory demanding jobs (e.g. BQSR), the queue fills up and cluster stalls temporarily until those guys are out of the way. Some of them do use all the memory available to them, but I don't think they really need it and we could benefit from having at least one pod running next to them to drain the queue.
The second issue is trickier to deal with, but I think we can have major improvements to the efficiency we can tackle it: this is about trying to run a single workflow's jobs that are similar in essence all at once. I see this happening and happening for two obvious parts: 1) indel-map-reduce: when the indel map-reduce solution kicks in, we immediately end up with 23-24 jobs working on each of the chromosomes and competing with each other for the resources. Since these usually all belong to a single workflow (patient), our multi-NFS optimization doesn't apply here and having all those pods consuming results from the same nfs-pool (I think) makes them more failure-prone 2) tool installs: again a similar to the indel part, but this time we have tens of mini pods that are responsible for setting up the toolkit and reference data, and during initial runs, we always end up one or two of these failing and therefore taking their whole upstream down with them. This is a bit annoying because you end up with an unevenly distributed workflow that is partly running and at that point, it is hard to keep track of all such nodes and try to restart them.

I am not sure what the best solution would be here but I was fancying the idea of adopting a merge-sort-based randomization approach to evenly spread potentially similar tasks in time or in the queue so that we reduce their chances of causing issues to each other.

Relevant (old) read on such algorithm that we might benefit: https://labs.spotify.com/2014/02/28/how-to-shuffle-songs/

PS: For the resource estimation part, after trying and failing to deploy a Grafana+InfluxDB monitoring system into GKE's container engine (they do make it hard to do such a thing), I have been collecting some statistics from the GKE's own StackDriver and will try to embed some estimates in to see whether they will make a difference.

Re: the rest... just some daydreaming for now ;)

hammerlab / biokepi

Better estimate each job's resource requirements and encode them into Biokepi as defaults #476