hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.86k stars 1.95k forks source link

[Feature] Ability to have different dispatching policies for different sets of servers such as datacenter, nodeclass or meta #11868

Open linluxiang opened 2 years ago

linluxiang commented 2 years ago

Hi there,

Proposal

I'd like to have a feature of using different dispatching policies for different sets of servers. For example for nodeclass 1, the allocations are dispatched evenly on all the machines, and for nodeclass 2, the allocations are dispatched using bin-packing algorithm.

This is a description of my use case.

Use-case

I have two datacenters, dc1 and dc2. I’d like the system to work like this:

1. The job should be executed on dc2 only after dc1 is full.
2. On dc1, the job should be dispatched evenly for all the machines but on dc2, the jobs should be dispatched using bin-packing.

I created a parameterized job and used this setting of affinity:

affinity {
  attribute = “${node.datacenter}”
  value = “dc1”
  weight = 100
}

The 1st goal seems to be achieved but the the 2nd one doesn’t. I searched in google and found someone said “nomad has an automatic anti-affinity to prevent a node from running too many jobs.”. But when I was using nomad alloc status to check the ranking, the anti-affinity score was always 0, not negative.

I guess anti-affinity is automatically calculated, thus I cannot control it. So to achieve the goals I'd like to request a new feature.

Could you mind taking a look at it please?

Thank you

lgfa29 commented 2 years ago

Thanks for the idea @linluxiang.

I think it's a very interesting concept, but, given how Nomad internals work today, it may tricky to implement.

blissend commented 2 years ago

Yes please, an example of where bin-packing becomes a problem is IP rate-limiting on API requests or limited network throughput on clients. Workarounds with spread stanza to target multiple datacenters or show favoritism towards a specific client can still be problematic with numerous jobs (either because of the manual efforts if job generator not made or still kinda suffering bin-packing).

The Nomad scheduler uses a bin-packing algorithm when making job placements on nodes to optimize resource utilization and density of applications. Although bin packing ensures optimal resource utilization, it can lead to some nodes carrying a majority of allocations for a given job. This can cause cascading failures where the failure of a single node or a single data center can lead to application unavailability.

Source: https://learn.hashicorp.com/tutorials/nomad/spread

Is that the official reason nomad uses bin-packing? Makes sense but thinking about fixed costs and no autoscaling for clients regardless of resource utilization is it still optimal? Probably missing some nuance here.