For systems that don't have an L3 descriptor I'd originally used --map-by socket to map MPI processes, thinking that would be roughly equivalent. It turns out for KVM that's not correct because each core is also a socket; that would then map all the MPI process's OpenMP threads to exactly 1 core which is not at all what we want. Fortunately --map-by numa looks to work properly.
For systems that don't have an L3 descriptor I'd originally used --map-by socket to map MPI processes, thinking that would be roughly equivalent. It turns out for KVM that's not correct because each core is also a socket; that would then map all the MPI process's OpenMP threads to exactly 1 core which is not at all what we want. Fortunately --map-by numa looks to work properly.