open-mpi / hwloc

Hardware locality (hwloc)
https://www.open-mpi.org/projects/hwloc
Other
570 stars 173 forks source link

Cannot bind to several memories #601

Closed antoine-morvan closed 1 year ago

antoine-morvan commented 1 year ago

What version of hwloc are you using?

hwloc & lstopo (version 2.9.2)

Which operating system and hardware are you running on?

Linux 4.18.0-372.26.1.el8_6.x86_64 #1 SMP Sat Aug 27 02:44:20 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

Intel Xeon 8358 (Ice Lake). Topology: icelake

Details of the problem

I am trying to bind processes so they use dedicated set of memory banks (in this case, 1 and 3). My wish would be to use something like this:

hwloc-bind \
    --cpubind all \
    --membind numa:1 numa:3 \
    $CMD

# or
hwloc-bind \
    --cpubind all \
    --membind numa:odd \
    $CMD

However such commands will result in the first memory place to be used is used only (e.g., numa:1 only). For instance :

hwloc-bind \
    --cpubind all \
    --membind numa:odd \
    lstopo-no-graphics $LSTOPOARGS --pid 0 > file.svg

Produces this output:

icelake_numa_odd

Whereas I was expecting all the odd numa memories to be used.

Just to make sure this is not a system limitation, I ran the same lstopo with numactl binding:

# note: numactl physical and logical numbering is the same in this example
numactl --membind 1,3 \
    lstopo-no-graphics $LSTOPOARGS

which gives me:

icelake_numactl

Where we can see that numactl was able to memory bind to the 2 memories, as I expect.

Did I miss some argument in the hwloc-bind call ?

Best.

bgoglin commented 1 year ago

Hello. Fasten your seat belt, this is a bit complicated.

There are two main ways to bind memory on Linux, MPOL_BIND and MPOL_PREFERRED (there's also INTERLEAVE but it doesn't matter here). numactl uses BIND by default (if the nodes you give are full, allocation fails). hwloc uses PREFERRED by default (if the nodes you give are full, allocation falls back to other nodes). You may pass the STRICT flag (or --strict on the command-line) to switch hwloc to BIND instead of PREFERRED.

Strictly speaking, the default hwloc isn't wrong: it's allocating memory inside the mask you've given, but the capacity is indeed more limited than expected, but it has a fallback if the capacity is exceeded.

The reason PREFERRED shows a single node is that the old implementation in Linux basically ignores all nodes but the first one in the mask you give. There's a new implementation called MPOL_PREFERRED_MANY in kernel 5.15 which would likely fix your report, but I guess it's not available in your redhat kernel. If you try "numactl -p 1,3" instead of "numactl --membind 1,3", this tells numactl to use PREFERRED instead of BIND, and I guess it will fail because you're giving multiple nodes and the kernel doesn't support it.

antoine-morvan commented 1 year ago

Hello, thanks for the details.

The --strict flag indeed fixes the issue described in the first post.

Those binding modes (bind, preferred, interleave) are detailed in some documentation I have read recently (https://www.intel.com/content/www/us/en/content-details/769060/intel-xeon-cpu-max-series-configuration-and-tuning-guide.html?DocID=769060 page 27, section 6.2.1).

This document mentions 4 modes for binding to memories using numactl:

I was expecting hwloc-bind to expose such control via --mempolicy. Why the need of this --strict flag instead of exposing --mempolicy=preferred ?

bgoglin commented 1 year ago

Because hwloc is not Linux specific :/ We try to keep the API portable (and simple). Other operating systems expose different policies, finding some sort of common denominator was very difficult. That said, I could try to better document things and/or add a Linux specific option such as hwloc-bind --linux-mempolicy=preferred/bind/interleave.

antoine-morvan commented 1 year ago

I see, thanks again for your time. I definetly agree the doc would greatly benefit from such additions 👍

My last 2 cents: if you are going to expose a linux specific flag (e.g., --linux-mempolicy, that is breaking the 'common denominator'), why not expose linux specific options in the existing flag (--mempolicy) ? :)

bgoglin commented 1 year ago

--mempolicy current uses the hwloc terminology (bind/interleave/firsttouch/nextttouch). I'd need a way to understand if people are asking for hwloc's "bind" policy or Linux's "bind" policy. Could be --mempolicy linux-bind or something like this.

By the way, if there are some places in the doc that you already found unclear, please me know. Usually these kinds of clarifications go in the hwloc-bind manpage and in the introduction of the "Memory binding" section in hwloc.h. I may add something in the doxygen text too ("CPU and Memory Binding Overview").

bgoglin commented 1 year ago

I pushed several updates to the doc in master, v2.x and v2.9, hopefully that will help avoid the confusion between policies.