radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

respect numa-domain boundaries in agent scheduler #2990

Closed andre-merzky closed 9 months ago

andre-merzky commented 1 year ago

See https://github.com/radical-cybertools/radical.entk/issues/641

andre-merzky commented 1 year ago

This turns out a bit more tricky than I thought. Well, what else is new under the sun??

First, as correctly pointed out during the call, we cannot assume that NUMA domains are continuous segments of hardware resources. Instead we need to introduce an explicit mapping of resources to NUMA domains.

Second, we also need to recalculate resource offsets for the virtual NUMA nodes. Consider CUDA_VISIBLE_DEVICES of a NUMA domain which contains the second GPU on a node. If we simply consider the NUMA domain its own virtual node, then the GPU in that node will have index 0 and CUDA_VISIBLE_DEVICES will be set to that value. But from a system perspective that GPU still has index 1 which is then inconsistent. The same holds for pinned GPU core IDs.

Thus the mapping needs to (a) decide which hardware resources map to which NUMA domain, and (b) needs to map the virtual resource ID in the NUMA node to the physical resource ID on the compute node. (c) we need to add code somewhere which does the mapping-back before the slot information reaches the launch method (likely should be done in the scheduler).

andre-merzky commented 1 year ago

This needs three steps

andre-merzky commented 1 year ago

Copying from https://github.com/radical-cybertools/radical.entk/issues/641

@GKNB :

We are still working on this. One problem we see right now is the lack of rankfilesupport in Polaris' mpiexec. without that support we cannot really enforce any layout determined by the scheduler, nor can we enact any specific layout provided by the end user. We are iterating with Polaris support on how to address this issue.

andre-merzky commented 1 year ago

@GKNB : did you get any information from Polaris support about rankfile support in MPIEXEC?

GKNB commented 11 months ago

Unfortunately no, last time when I contacted the Polaris support team, they didn't mention anything to the rankfile and suggested me to use --ppn / bind to perform binding

andre-merzky commented 9 months ago

This is put on hold for the time being. A draft implementation is tagged as tag:feature/numa_domains