improve user control when scaling down task groups

mongi3 commented 4 years ago

Issue

Consider a nomad job containing all tasks for a capability including a group corresponding to a worker pool. This worker pool processes long duration tasks from a queue. An external process can monitor application state (like queue depth) and resubmit the job with a differing count to scale the pool up or down. This works well, for the scale up, but causes trouble scaling down with long duration tasks.

Specifically, there is no control over which task groups get eliminated in a scale down event. Ideally you'd have some influence so that you could take down idle workers rather than interrupt ones already well into a long duration task.

Some things I've tried thus far

Using nomad 0.11.1 running in dev mode...

Stopping an allocation and then immediately scaling down hoping the scheduler would remove that allocation. It doesn't seem to, instead it seems to prefer taking the most recent allocation regardless of state (pending, failed, etc.)
Banking on the "removal of newest" behavior, I tried doing a restart of the allocation I wanted to remove, hoping to make the most recent, but this also doesn't work. The tracking doesn't seem to consider the reallocation as newer.
I tried changing my paradigm a bit. I changed workers to self-destruct when idle for a sufficient amount of time and in nomad job I disabled restart/reschedule behaviors. I thought I could then have an external controller monitor if too many went down and resubmit the nomad job with desired counts. Unfortunately I found nomad is happy to count the dead allocations against the desired worker count rather than deploy new allocations with seemingly no way to re-kick those failed allocations. Specifically they don't honor any updates to the job restart/reschedule configuration and explicitly allocation restart behaviors only apply to presently running allocations.
Took a good look at the new auto-scale capabilities, but it looks to have all the same limitations.

Solution Ideas

I'd love to hear inputs on alternate ways to structure things to solve this issue, but I think regardless the ability to provide some input on removal decisions is a good thing. I don't really know the best way forward, but here are some starter options to consider:

Provide an explicit mechanism for marking or weighting allocations as better/worse candidate for removal to directly feed scheduler decisions.
Have the scheduler consider activity (like CPU) of allocations for take down. Generally I'd suspect allocations using less CPU are more likely to be idle and better candidates for elimination.
If elimination of newer allocations is a reliable behavior, make it so that re-allocations are indeed considered "newer".

Thoughts?

tgross commented 4 years ago

Hi @mongi3! Typically the way you'll want to handle this scenario is to have the worker pool scale-down on workload vs resource availability (either queue depth or CPU/memory resources available), which will be well-supported with the new autoscale capabilities.

But as you've noted, this leaves the problem of which specific allocations get terminated. I would recommend avoiding trying to solve that problem. Even if you could target specific allocations for scaling, it'd create an explosion of complexity in decision making around things like affinities and constraints.

Instead, if you ensure the workloads are re-entrant and checkpoint their work-in-progress back to the queue, then when Nomad stops an allocation the work will be picked up by another allocation. You'll want to have done this anyways to handle non-Nomad-related failures like network glitches or failure of the underlying hosts.

mongi3 commented 4 years ago

@tgross thanks for the input. I especially like the idea of checkpoints to reduce reprocessing on failure. In general I do as you suggest so processing is resilient to node failure, but in some of my cases taking out in-progress processing will result in data loss that can't be covered this way. For example, the worker pool can be the thing that is doing live data aquisition and if a busy worker is shutdown, then his data is lost to the ether until the system realizes a worker has gone down and shifts to another worker. I can't fix that node failure case without impractical redundancy, but hopefully we can do something about the scale down case which I hope will be much more common. :)

I'll have to take your word on complexity as I can't see where those other factors matter with respect to scaling down specifically. That said, I'd hope the above case could demonstrate some value to balance any added complexity.

hashicorp / nomad