ocaml-multicore / domainslib

Parallel Programming over Domains
ISC License
171 stars 30 forks source link

Interface to query CPU setup, numa information and set affinity #14

Open ctk21 opened 4 years ago

ctk21 commented 4 years ago

It might be useful (at least experimentally) to add an interface that allows the user to easily:

Open questions for this are:

haesbaert commented 2 years ago

I'd like to take up on this, I wrote the cpu topology used in OpenBSD ages ago so hopefully I can contribute something. @dra27 and @kayceesrk expressed interested in this so I'm CCing them.

What follows is just a collection of thoughts, so take it all with a grain of salt, please :).

Topology

On the very basic, we could provide a smt (thread) count, a core count and a package (socket) count, and also the relationship between them and so on. On a more elaborate approach we would have to consider core asymmetry (like intel 12th gen) and modern arm64, as well as cache groups like amd's CCX. Maybe num_performance_cores num_efficient_cores can become a thing. Online vs offline cpus need some investigation, I'm not sure this is actually meaningful anywhere.

Domainsllib could provide this, but I don't think it belongs in the standard library. The standard library could provide num_threads at most as anything else would impose too many requirements on the underlying system, num_threads can be usually retrieved with a sysconf and/or sysctl and it's widely available.

To build the topology on x86 we need to be able to at least retrieve the apicid of each smt-thread, if that's possible we can at least build smt<>core<>package relationship, without having to resort to more operating system support like sysctl/sysconf and whatnot. That would be a more democratic approach, we can loop over all threads, set affinity one by one, call cpuid to retrieve the current apicid and then build the relationship tree.

We can't assume x86 or POSIX, we also can't cover every operating system and architecture so we might consider making Domainsllib able to fail on any of the queries, I think this is better than silently returning single core.

I haven't dived into Windows but I'm sure it provides this information somewhere, I'll write from the top of my head what I know:

Topology retrieval, from top of my head

Except for OpenBSD we can also just build our own by retrieving the pinning and fetching the apicid for each core (for x86).

Nomenclature

Naming this is kinda tricky, people use threads, cores, cpus with different meanings, also a lot of the jargon is inheriting from x86, like smt (maybe sparc!?) and packages, at any rate this must be considered. I personally like the idea of calling cpu the logical thing, as in the actual thread, but then thread become a synonym for cpu which is bad. Also cpu is more often than not used as processor, which in turn is way less ambiguous.

NUMA

I don't think any operating system provides any active toggles other than CPU affinity to a userland process, so pinning memory zones (as in the zones the acpi table gives you). Usually what they do (at least Linux/FreeBSD/NetBSD) is to try to map pages belonging to a memory zone of the affinity cpu, so if you say set_affinity(cpu1) it will try to map the pages "closer" to the die containing cpu1.

edit: I'm completely wrong here, linux has set_mempolicy(2), mbind(2) and more.

Affinity

Domainsllib could provide something like Cpuset.t -> ('a, 'e) result, I think we need to be able to fail since not every operating system provides those, some POSIX systems just fail silently on pthread_setaffinity_np(3) though. At any rate we should tell the caller that "we can't set the affinity" if possible.

Set affinity support, from top of my head

haesbaert commented 2 years ago

So I'm building this: https://github.com/haesbaert/ocaml-cpu So far only support for getting number of threads, setting and getting cpu affinity, works without multicore/Domains, only linux but I'll work on the rest.

num_threads: unit -> int
set_affinity: int list -> unit
get_affinity: unit -> int list
kayceesrk commented 2 years ago

Thanks @haesbaert. In general, we're avoiding the term "threads" in OCaml 5 since we have multiple notions of threading -- domains, fibers and systhreads. I'm leaning towards num_cores, where I'm using "cores" as a proxy for available units of parallelism. If hyperthreading is enabled, doesnum_cores return the number of hardware threads or physical cores?

Before we expose set_affinity and get_affinity, do we know that it is useful in practice for programs that use domainslib today? We have sandmark nightly benchmarking runs on the 64 core, 128 thread navajo machine: https://sandmark.tarides.com. We can experiment with affinity there. Also, the API of *et_affinity may need to operate on the pool abstraction.

haesbaert commented 2 years ago

If we avoid the term "threads" I think "cpus" should be used to refer to "logical cpus/threads". I believe most people associate "num_cores" with an actual core (as in the parent of threads).

As discussed on Slack, most OSes usually just give us a "get_cpu_count" which return the total number of logical cpus (aka threads), so if hyperthreading is enabled it will return twice the number of cores, if disabled, threads==core.

From now on I'll refer to "cpus" as in: total logical cpus available (total number of threads). Retrieving anything more than "total number of cpus" is very OS/MD dependent, Linux would involve parsing /proc or making a trip to each cpu and implementing the CPUID dance ourselves. Parsing /proc is ugly and won't work on chroot environments, doing the CPUID dance ourselves is a bit tricky, first because it's completely MD, second because the CPUID leafs tend to change, sometimes even between intel and amd.

Before we expose set_affinity and get_affinity, do we know that it is useful in practice for programs that use domainslib today? We have sandmark nightly benchmarking runs on the 64 core, 128 thread navajo machine: https://sandmark.tarides.com. We can experiment with affinity there. Also, the API of *et_affinity may need to operate on the pool abstraction.

I'm not sure, it probably depends a lot on the OS and socket configuration, I assume Linux does a decent job at keeping the pthreads on cpus of the same socket.

There is a social aspect to this discussion, think that no project out there is exposing much more than "number of cpus", I guess the affinity/pinning, when relevant, is done outside via taskset and similar. If you're reaching the point where you're using affinity, you probably know your architecture well enough to ponder about it (as in: it's your job to read /proc), that would be the case of https://sandmark.tarides.com where we know before hands where each cpu/core/socket is.

haesbaert commented 2 years ago

I have released https://github.com/haesbaert/ocaml-processor which hopefully addresses this.