openpmix / prrte

PMIx Reference RunTime Environment (PRRTE)
https://pmix.org
Other
35 stars 67 forks source link

Restore code to ignore NUMA nodes with overlapping cpusets if heterogeneous memory #2025

Open bgoglin opened 1 week ago

bgoglin commented 1 week ago

Putting back all the details here so that they don't get lost.

Commit d23ac62339ab2435c30ac8c83c33ce4bcdfb73c8 added the ability to handle heterogeneous memory by looking at which NUMA nodes have same or overlapping cpusets. We don't want to place one process per NUMA node if two nodes have a same CPUs. This was mostly done for POWER platforms with NVIDIA GPU memory exposed as NUMA nodes, but the code actually worked for all known heterogeneous memory platforms so far (KNL, Xeon Max, GraceHopper, DDR+NVM, DDR+CXL). That code was removed in 4b41c8e0a30befd21298b93c1aa12fcd6ac69b30

In the first case, NVIDIA GPU nodes on Power are hardwired at NUMA os_index starting from 255 and going down. Hence we may cutoff at 150 at only look at nodes before that. Earlier versions (2.0.1 for instance) did this (commits 5bc6079f6127e6bed6378acb22f9558ba198e9a6 and c414df039efd950b776e5aead5bf81602b6dc866) changed the cutoff os_index from 200 to 150 for instance).

In the second case, "default" memory nodes (e.g. DRAM in the vast majority of cases, HBM is there's no DRAM, etc) are listed first in ACPI tables and thus first by os_index too. Those first default nodes cannot have intersecting cpusets either (in ACPI and Linux kernel). So the idea of d23ac62339ab2435c30ac8c83c33ce4bcdfb73c8 was to iterate over NUMA nodes starting from os_index 0 until we find an intersecting cpuset ones, and stooping before it.

Recent hwloc versions may also help since they are able to detect the kind of NUMA nodes. Starting with 2.8, they may have a "DRAM" subtype (or "HBM", etc). The detection was improved in 2.10 which also provides a "MemoryTier" info attribute to identify similar nodes. Tiers are sorted by performance, so you would get HBM first, DRAM then, GPU then, ... Taking the first tier should be fine since it's unlikely we'll get a machine with HBM only near a subpart of the CPUs.

rhc54 commented 1 week ago

Might as well add this comment from the Slack thread before they delete it:

this comment is wrong:
+ * Fortunately, the OS index of non-CPU NUMA domains starts
+ * at 255 and counts downward (at least for GPUs) - while
+ * the index of CPU NUMA domains starts at 0 and counts
+ * upward. We can therefore separate the two by excluding
+ * NUMA domains with an OS index above a certain level.
+ * The following constant is based solely on checking a
+ * few systems, but hopefully proves correct in general.
it's only true for NVIDIA GPUs on POWER. not true on GraceHopper,
or on platforms with heterogeneous memory. The original idea was
to cut after the first overlapping-cpuset node because that works in
all known cases (because DDR is always first, followed by everything
else including HBM, NVM or GraceHopper GPU memory, and (I think)
Frontier GPU memory). (edited) 
rhc54 commented 1 week ago

Like I said, all getting very complicated - may be beyond the time I have left. Perhaps beyond what we can reasonably expect an RTE to do. Could be we are getting to the point where automatic placement algorithms are reaching their limit and users are going to just have to start manually placing things...sigh.