hashicorp / nomad-autoscaler

Nomad Autoscaler brings autoscaling to your Nomad workloads.
Mozilla Public License 2.0
423 stars 84 forks source link

Resume a client node scheduled for draining #521

Open pablopla opened 3 years ago

pablopla commented 3 years ago

Draining a game server node or a video conference node might take hours. The autoscaler starts draining a node when load is low and creates a new node when load is high. The autoscaler might start new nodes while there are existing nodes that are draining. In this case it will be better to let the autoscaler cancel the draining and resume existing nodes.

I suggest adding an option draining_cancelable=true/false direct call to the node/drain command will set it to false by default (can be changed). autoscailing will always set it to true auto scaling will give nodes with draining_cancelable=true priority and stop draining instead of adding new nodes this way you can force draining when you want to recycle a node and autoscaling can cancel draining and resume nodes when needed

lgfa29 commented 2 years ago

Hi @pablopla 👋

I think I understood the idea, though the implementation may be a bit trickier 🤔

Maybe something that the RunPreScaleInTasks could do and then adjust the required number of instances to remove based on that?

pablopla commented 2 years ago

I'm just evaluating nomad and the autoscaler to my use case. I don't know anything about the architecture or RunPreScaleInTasks. Being able to resume a client node or cancel draining feels like something reasonable to expect from an autoscaler.

lgfa29 commented 2 years ago

Yeah, it sounds like a good idea. We don't have a timeline for implementation, but I will place it into our backlog.

Thanks for the idea!

adamsmithkld commented 1 year ago

For what it's worth, having the autoscaler cancel scaledown to scale back up is a feature my team would greatly appreciate having. We are seeing inefficiencies in our Azure clusters with bursty workloads where we are stuck scaling down for an extended period of time waiting for Docker containers to finish while work piles up. Once the scaledown event finally finishes, the cluster immediately scales way back up.