[question] Unbalanced job scheduling

cleiter commented 7 years ago

I'm running Nomad 0.5.0-rc1 (same behavior was with 0.4.1) with Consul 0.7.0 and I have deployed Nomad/Consul in 3 datacenters, with 1 server and 2 clients in each datacenter. All client and server nodes are of the same instance type and are configured identically.

My jobs are docker jobs with 2 instances and have exactly 1 group and 1 task with distinct_hosts = true and datacenters = [ "eu-west-1a", "eu-west-1b", "eu-west-1c" ].

The current allocations look like this:

ID        DC          Name        Class   Drain  Status  Running Allocs
beb65f97  eu-west-1a  worker-a-0  <none>  false  ready   6
c78bd8e2  eu-west-1a  worker-a-1  <none>  false  ready   5
52cabf10  eu-west-1b  worker-b-0  <none>  false  ready   0 <-
e4c1467f  eu-west-1b  worker-b-1  <none>  false  ready   0 <-
8d1cff29  eu-west-1c  worker-c-0  <none>  false  ready   4
9289081c  eu-west-1c  worker-c-1  <none>  false  ready   5

No matter how often I redeploy jobs Nomad seems to dislike datacenter B for some reason. When I set the instance count to say 5 it will actually deploy jobs to nodes in datacenter B which means they are working correctly. When I switch back to 2 instances it will remove them from B and those nodes are idle again. So these machines are idle all the time while other machines run 5-6 services.

What's also a bit concerning is that now jobs are actually running twice in the same datacenter (on worker 0 and 1) while I'd like Nomad to distribute them across datacenters, and as far as I understood the documentation it should.

I wasn't able to find anything interesting in the logs and I don't know a way to see Nomad's reasoning for job allocation. Is this expected behavior or is there something I could do to have the jobs more evenly distributed? Any help is appreciated.

dadgar commented 7 years ago

Hey @cleiter,

Nomad uses a bin-packing algorithm. This will cause it to fill nodes before moving on to others which is the behavior you are observing. This is to allow more optimal placements of mixed workloads.

Currently we do not have controls to allow spread across DCs but that is something we would like to add in the future. If you must have your tasks spread across the two datacenters you have two options:

1) A job per DC 2) Two task groups in the same job, each one adding a constraint on the desired DC.

Let me know if you have any other questions and if this answered your question would you mind closing the issue. Thanks!

cleiter commented 7 years ago

Hi @dadgar, thanks for the explanation! But I have to say this behavior is still quite surprising to me. I just don't see how not using the resources available in the cluster - leaving machines totally idle while others have multiple jobs - might be a desired scheduling choice. Seems bad for performance and robustness. The only thing I can think of is that if I ever have a job that needs 100% available memory of a node then it might be good to have a spare one for that.

Comparing that to Kubernetes (which I've actually never used): https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler_algorithm.md#ranking-the-nodes The priority functions LeastRequestedPriority, BalancedResourceAllocation, SelectorSpreadPriority seem to be what I'd have expected.

What's the rational for Nomad's behavior? Are there any plans in making the Nomad scheduling algorithm configurable, e.g. by supplying a number of priority functions?

cleiter commented 7 years ago

I've just increased the cpu resource requirements which leads to a more even distribution of jobs. The disadvantage of that is that if one datacenter goes down Nomad might not be able to move jobs to other nodes because of these constraints. 😕

dadgar commented 7 years ago

@cleiter I think there should be a spread between DCs I am not disagreeing there! In terms of the link you posted we do: SelectorSpreadPriority and bin-packing. This is a pretty studied part of schedulers and spread is among the worse choices for utilization. The borg paper and its references are great reads if you are interested: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf

cleiter commented 7 years ago

@dadgar Thanks for the link, I read the parts regarding scheduling.

I think it depends on the point of view. If you have a more local view it makes sense to run as many services as possible on a single node. It's efficient regarding the total number of nodes required, that's obviously better utilization of a single machine.

The other perspective is when you have a fixed number of nodes and want to spread the jobs across all those nodes. This seems more efficient performance wise. Services can use more CPU power than they are guaranteed and there's less concurrent I/O. Another advantage is that if one node goes down there are less services to move to other nodes. I would say that's better utilization of the whole cluster.

The second scenario is what I have and what I image is not that uncommon. To exaggerate a bit, let's say you have 10 nodes which are capable of running 10 services each. When you deploy 10 services, would you rather have them all on a single node or a single service running on each node? :)

Don't get me wrong, I really like Nomad and I think the current approach makes sense, but the second scenario seems to be a plausible use case as well. I'm not a scheduling theory expert, this is just what I intuitively expected, so feel free to convince me otherwise. :)

I think I can use datacenter constraints so that each service is deployed exactly once to each datacenter to get the behavior I image (performance, resilience, utilization) but I would be really happy if Nomad offered an option to prioritize scheduling to the nodes with the lowest current workload.

dadgar commented 7 years ago

Again want to be clear I agree the DC behavior is unexpected and should be addressed.

The reason you don't want the lowest current workload scheduling is what happens when you have 1 service on each of the ten nodes and then a new job is submitted that requires all the resources of one of the nodes! You can't schedule it when with any amount of bin-packing you would have many free nodes.

You should reserve resources for your tasks so they meet their SLO independent of how many other tasks are on the same machine

cleiter commented 7 years ago

This depends on the usage. In a microservice architecture I know I won't have jobs which need 100% of the resources of a single node and I would be happier if 1/3 of the machines I'm paying for aren't idle all the time.

What should happen if you schedule a job which needs more resources than are currently available on a single machine is described in the paper you linked above:

If the machine selected by the scoring phase doesn’t have enough available resources to fit the new task, Borg preempts (kills) lower-priority tasks, from lowest to highest priority, until it does. We add the preempted tasks to the scheduler’s pending queue, rather than migrate or hibernate them.

I would still argue for a configuration option so that every user can tweak the node priorization to their needs.

Feel free to close this issue, I opted for the dc constraints and now the usage is more balanced. :)

dadgar commented 7 years ago

Sure! Hope this didn't come off as an argument! Was simply discussing it with you :)

Yeah once Nomad has pre-emption we can re-access some of the scheduling decisions! I am looking forward to that :)

Thanks, Alex

cleiter commented 7 years ago

Yeah, sure. :) Overall I'm really happy with Nomad and I appreciate the work you put into it! It's still a young project so I'm sure there will be many improvements in the future.

linuxgood1230 commented 6 years ago

like databae , mirgate cost very much. I do not know how many resource may be used at first. but later, one node has ten+ alloces, when one or two alloc used up their resource, the while node will overload. then the node has no response, and will down. Then migrate to anthor node, then the node also will down. nomad light use raw_exec or java dirver, but it's schedule is human schedule, not wise.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / nomad

[question] Unbalanced job scheduling #1990