hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.56k stars 1.92k forks source link

allow setting application thread affinity #20531

Open caiodelgadonew opened 2 months ago

caiodelgadonew commented 2 months ago

Nomad version

Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c

Operating system and Environment details

AlmaLinux release 9.3 (Shamrock Pampas Cat)
NAME="AlmaLinux"
VERSION="9.3 (Shamrock Pampas Cat)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.3 (Shamrock Pampas Cat)"

Issue

Nomad shouldn't override taskset defined inside binaries in raw_exec even when the reserved cores are configured in the client stanza

Reproduction steps

Client configuration

client {
  reserved {
    cores = "1"
  }
}

In the raw_exec driver a C++ binary is run, this binary spawns multiple threads and one of them has the affinity defined by the binary itself using the sched_setaffinity

eg.:

#include <thread>
#include <sched.h>
void pin_cpu(int core) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core, &cpuset);
    if (sched_setaffinity(0, sizeof(cpuset), &cpuset) < 0) {
    } else {
        std::cout << "Success!" << std::endl;
    }
}

But if we have this configuration on Nomad client, when we run the binary it overwrites the taskset leaving the thread to run only on non-reserved cores.

image

Expected Result

Nomad does not taskset on threads that has specific affinity in the binary code.

Actual Result

Nomad overwrites the affinity of the thread.

EXTRA DETAILS

I've experiment the core isolation feature over the following settings:

  1. Client config cores = "0,2-7" & NOMAD_CPU_CORES="0-7" image

  2. Client config cores = "1" & Job NOMAD_CPU_CORES="0-7" & resources.cores = 1 image

  3. No specific Client config & Job NOMAD_CPU_CORES="0,2-7" ( THIS makes me think the NOMAD_CPU_CORES is not working as intended) image

tgross commented 1 week ago

Hi @caiodelgadonew! I'm pretty sure this is working as intended. If the client has been configured to reserve cores so that it doesn't assign workloads to those cores, it'll need to set the cpuset for all tasks such that they fall outside the reserved cores. Otherwise the cores are not meaningfully "reserved", right?

What's the goal here with having the application set its own affinity for one thread? Is this something you can solve by giving the application a resources.cores and then having the thread pick one of those cores?

caiodelgadonew commented 1 week ago

Hi @caiodelgadonew! I'm pretty sure this is working as intended. If the client has been configured to reserve cores so that it doesn't assign workloads to those cores, it'll need to set the cpuset for all tasks such that they fall outside the reserved cores. Otherwise the cores are not meaningfully "reserved", right?

You're correct, the naming was somehow misleading to me, the reserved.cores in the nomad agent stands for Specifies the cpuset of CPU cores to reserve as seen here also the reserved stands for Specifies that Nomad should reserve a portion of the node's resources from receiving tasks. as seen here

What I've understood is that I could say to nomad "Please don't allocate anything to the core X" and then the c++ app itself could pin its specific thread to the core that was not in use by nomad. But what happened is that nomad was preventing the app to pin its thread to the specific core

What's the goal here with having the application set its own affinity for one thread? Is this something you can solve by giving the application a resources.cores and then having the thread pick one of those cores?

Just to clarify a bit, I work in a trading company and latency is something really important for us, so sometimes we pin a specific thread to a specific core so the core is busy only in that low latency thread.

What we did to workaround this is script a service that checks the resources.cores and set the affinity of the tasks to the first core, and the rest to the remaining ones.

What we would like is that nomad does not schedule anything on a specific core but we could specify a thread running in a nomad task to have its affinity set to run in a specific core.

Not sure if nomad can help on anything on that case since its too specific, but I hope I was clear on my message, let me know if something was still confusing.

About the issue, I'm not sure also if it should continue or be closed.

tgross commented 1 week ago

What we did to workaround this is script a service that checks the resources.cores and set the affinity of the tasks to the first core, and the rest to the remaining ones.

Yeah, that sounds like the right move here. We expose NOMAD_CPU_CORES in the task's environment for just this kind of thing.

What we would like is that nomad does not schedule anything on a specific core but we could specify a thread running in a nomad task to have its affinity set to run in a specific core.

Not sure if nomad can help on anything on that case since its too specific, but I hope I was clear on my message, let me know if something was still confusing.

Typically Nomad has avoided getting into managing what's happening inside the task boundary. That is, Nomad provides the "container" (whether a literal Linux container or otherwise) and then it's up to the application what to do inside. Managing individual thread affinities is likely out of scope for us. That being said, we've recently shipped NUMA aware scheduling in Nomad Enterprise, so there's some precedent for giving a little more control here.

I'm going to re-title this issue as a feature request and mark it for further discussion and roadmapping.