hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.8k stars 1.94k forks source link

Add outgoing network bandwidth ressources option #19221

Open RSWilli opened 9 months ago

RSWilli commented 9 months ago

Proposal

Under the ressources section of the job specification, it may be useful to add an upstream (maybe even downstream) bandwidth requirement option

Use-cases

I am running jobs that continually stream an audio signal to different upstream servers. Each job consumes some CPU and RAM ressources that were already balanced out, and are pretty minimal. The bandwidth of the machine is known and the required bandwidth of the job is pretty much constant or has a upper bound previously known.

It would be nice if nomad would keep track of the provisioned bandwidth of a node and only schedule jobs on nodes that do have enough space left. The job is allowed to (temporarily) take up more than the given bandwidth, and nomad shouldn't kill it then (similar to CPU shares).

Nomad taking care about networking bandwidth could prevent these streams from failing because of other jobs that are already constantly using the bandwidth.

Attempted Solutions

This is similar to the deprecated https://developer.hashicorp.com/nomad/docs/upgrade/upgrade-specific#nomad-0-12-0 but not exactly. I don't know if incoming bandwidth would be something useful to add, although I can see the use case for outgoing traffic.

jrasell commented 9 months ago

Hi @RSWilli and thanks for raising this issue which seems like a very interesting idea. I am a little unfamiliar with how to understand the total available bandwidth on a machine, do you have any pointers that would allow me to quickly assess this feature a little more? Specifically, Nomad would need a way to fingerprint this value and make it available to the servers as part of the node object, to use when performing scheduling calculations.

nomad shouldn't kill it then (similar to CPU shares)

Nomad does not act as the killer in this scenario, it is the kernel which performs the terminations. It is not feasible to have Nomad clients continually monitor application resource utilisation, therefore any application termination would be dependant on the kernel features. From Nomads perspective, the bandwidth resource would be used in a booking fashion for scheduling purposes.

RSWilli commented 5 months ago

Sorry for the (very) late reply.

I am a little unfamiliar with how to understand the total available bandwidth on a machine

In my head it was as easy as "just configure the total bandwidth when starting the nomad server", but I understand that this can become more difficult when a server has multiple interfaces with multiple different bandwidths.

kernel which performs the terminations

The kernel features required to enforce these limits are even further out of my comfort zone...