Ability to rebalance allocation placements

ghost commented 8 years ago

I'm not sure how simple this would be to implement but it would be great we could make distinct_hosts best effort - this would be useful when backend instances (clients in Nomad) need to be taken offline for maintenance or upgrades.

For example, if we have a task group with a count of 3, distinct_hosts = best-effort and 3 Nomad clients, the task group would be distributed across the three instances as one container per instance.

If we then take offline one of the three backends for maintenance (or if it failed due to a kernel panic or networking issue) the scheduler would provision the container from that backend on one of the remaining backends. The scheduler would then detect that either the backend recovered or a new backend joined the Nomad cluster and rebalance the containers to restore the distinct hosts invariant.

dadgar commented 8 years ago

So Nomad does do a best effort spread between clients when they are running the same job, so we wouldn't need to add that. I think the point you are getting at is that you would like Nomad to rebalance occasionally.

If I am drawing the right conclusion maybe we retitle the issue?

ghost commented 8 years ago

I think the point you are getting at is that you would like Nomad to rebalance

That - but also - I would like a guarantee that if I have 3 task groups and 2 or more instances then Nomad doesn't run all 3 containers on the same instance - because if I need to take an instance offline I can't do it without either taking the whole service down with it or increasing the count and hoping that more containers are started up on other instances - but at the same time not having an unscheduable job because I have 2 instances and a count of 3 with distinct_hosts - if that makes sense? Is that a feature in the scheduler at present?

dadgar commented 8 years ago

It is not a feature currently and I think the rebalance + current behavior would solve that. If you could initiate a rebalance then the scheduler would want to spread the task across different hosts naturally (without the distinct_host constraint even set).

I am going to rename the issue

jemc commented 7 years ago

Being able to initiate a rebalance is a good idea.

But I'd also like to add that it would seem appropriate for Nomad to automatically attempt to do a rebalance upon failure to make an allocation (due to lack of resources). That is, I would expect Nomad to do a rebalance if doing so would make the allocation succeed. I was actually surprised to learn that this wasn't already implemented for a scheduler like Nomad.

discobean commented 7 years ago

+1 for a best effort distinct_hosts feature w/ auto or manual rebalance option would certainly be helpful for us

MDL-Cloud-Ops commented 7 years ago

That - but also - I would like a guarantee that if I have 3 task groups and 2 or more instances then Nomad doesn't run all 3 containers on the same instance - because if I need to take an instance offline I can't do it without either taking the whole service down with it or increasing the count and hoping that more containers are started up on other instances - but at the same time not having an unscheduable job because I have 2 instances and a count of 3 with distinct_hosts - if that makes sense? Is that a feature in the scheduler at present?

We face the same issue with Nomad 0.5.5. Assume this is still open.

To put it simply, we don't want to have to manual construct artificial job specifications (such as task group per availability zone) in order to have a job with at least two instances spread across multiple AZs in AWS. Not doing so creates a serious reliability impediment and manually manipulating the scheduler undermines the value of having a scheduler in the first place.

Going further, I would argue that tight bin-packing needs to be balanced with reliability requirements. I would rather see relatively frequent, carefully orchestrated service movements, than a single highly loaded server. I think this also plays into how Nomad works nicely with automatic scaling of clusters, such as AWS ASGs.

To me this is a fairly big feature but also the next critical thing that determines whether Nomad will be the scheduler of choice or being forced to move to an alternate technology. It feels wholly wrong for the user of a scheduler to solve these inherent challenges of scheduling.

Thoughts, reactions?

djenriquez commented 6 years ago

Any updates/thoughts to this feature? HA is definitely a higher priority for us than efficiency, would love to be able to distribute on new nomad clients becoming available!

daledude commented 6 years ago

Even merely the ability to re-evaluate a job away from it's current node would be at least something.

CumpsD commented 6 years ago

Looking forward to this as well, I was amazed to see one job have 3 instances on 1 node, while 2 other nodes were added to the cluster, doing nothing. I expected a rebalance to occur towards the new nodes

alitvak69 commented 6 years ago

We are at version 0.8.3 at this point. Is there a vote with the money feature. I would suggest we could collectively pay Hashicorp to develop this very important feature.

hvindin commented 6 years ago

@alitvak69 we mightn't need to go all the way to crowd funding quite yet, we're I'm building an internal shared hosting platform at the moment, we currently have auto-rebalancing accross hosts as well as being able to loosen the grip on the bin-packing a bit so we can spray across a not-so-elastic internal cloud as being the things we need from hashcorp in the next few months otherwise we're going to need to follow the rest of the market and do the same kubernetes thing as everyone else.

We're sure as hell not a fly-by-night tiny basement operation and looking at how much money we're pouring into infrastructure automation for bad solutions that don't work, I suspect there are a few people willing to put a lot of investment into the hashicorp ecosystem if we could just get these seemingly small kinks out of the way.

But seriously, I know we got the reschedule stanza recently, so there's obviously some recognition that shuffling jobs between existing nodes is a desired behaviour in some scenarios but it would be nice to be able to encourage nomad to be a bit more lax about keeping a job on the hottest node if it means a more likely startup success and runtime stability, even if it means we end up with some extra capacity wasted away on spare servers.

dadgar commented 6 years ago

Hey all,

Just an update on this issue. We understand it is important and there are plans for both short and long term solution. The short is that we will have an allocation life cycle API where individual allocations can be killed and the scheduler will replace them. The longer term solution is a rebalanced system that detects these issues and rebalances the cluster over time or on-demand via an API.

jippi commented 6 years ago

@dadgar sounds good! is short term Nomad 0.11 ?

dadgar commented 6 years ago

@jippi Aiming to have allocation life cycle APIs in the 0.9.X series.

jippi commented 6 years ago

@dadgar nice!

suslovsergey commented 5 years ago

@dadgar any news?

KamilKeski commented 5 years ago

@dadgar now that we are in the 0.9.x releases is there a more solid target on lifecycle api's? Much appreciated!

langmartin commented 5 years ago

The allocation lifecycle APIs made it into 0.9.2, documented here: https://www.nomadproject.io/api/allocations.html#stop-allocation

pashinin commented 4 years ago

I can stop allocation with Nomad UI in 0.10.4 and it will start on another node.

Is there a plan to have an automatic rebalance now?

idrennanvmware commented 4 years ago

+1 for rebalance.

Stopping an allocation really doesn't help our scenarios. In our case we have changing node metadata that can cause allocations to move around - and given that jobs control their constraints its not desirable to go and figure out all of that. What we would expect is that the scheduler periodically looks for new nodes to place/rebalance allocations on, and ALSO looks for allocations that should be removed because they no longer meet the constraint.

In our experiments, if we change a constraint attribute the allocation will never leave the node until a new job update comes (even running --force-reschedule does not cause the allocation(s) to be reevaluated)

robbah commented 2 years ago

Any news on this already?

tgross commented 2 years ago

@robbah we'll update issues with news when we have it. Please feel free to add :+1: reactions to the top-level comment (which we do look at), but please don't spam issues with bumps.

hashicorp / nomad

Ability to rebalance allocation placements #1635