P vs E cores in Open MPI

ggouaillardet commented 1 year ago

I just saw this question in Stack Overflow

https://stackoverflow.com/questions/75240988/openmpi-and-ubuntu-22-04-support-for-using-all-e-and-p-cores-on-12th-gen-intel-c

TL;DR on a system with 8 P cores (2 threads each) and 8 E cores (1 thread each), is there a (ideally user friendly) way to tell Open MPI to only use the P cores?

@bgoglin what kind of support is provided by hwloc with respect to P vs E cores?

rhc54 commented 1 year ago

I remember raising this while at Intel - IIRC, the answer was "nobody should be using these processors for MPI". Not really designed for that purpose. Best we could devise was to use the "pe-list" option to select only the p-cores as the e-cores are pretty useless for this application. It's a workaround, but probably your best answer if you insist on using such processors for HPC.

My guess is that someone is just trying to run code on a laptop for test purposes - in which case, restricting to the p-cores is probably just fine.

ggouaillardet commented 1 year ago

I am fine with using only the P cores for Open MPI.

I do not have access to such a processor and I do not know how hwloc presents it to Open MPI. Is it seen as a 8 (P) core systems? or as a 12 (8P + 4E) core systems?

ggouaillardet commented 1 year ago

FWIW, I asked the user to run mpirun --display-map -np 1 true to check whether Open MPI sees the E cores.

rhc54 commented 1 year ago

I honestly don't know how it is presented. I couldn't get the processor team to have any interest in hwloc support back then. The processor was designed in partnership with Microsoft specifically for Windows (which MS custom optimized for it), and MS had no interest in hwloc support.

I'm guessing hwloc should still be able to read something on it anyway. If they have hwloc on that box, then just have them run lstopo and provide the output from it - that's all we get anyway.

ggouaillardet commented 1 year ago

$ mpirun --display-map -n 1 true

 Data for JOB [3978,1] offset 0 Total slots allocated 12

========================   JOB MAP ========================

Data for node: xxxxxxxx Num slots: 12   Max slots: 0    Num procs: 1
   Process OMPI jobid: [3978,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../../../../../../../../../..]
=============================================================

there is something fishy here: according to the description, it should be 16 cores (8+8, unlike 8+4 I wrote earlier) and 24 threads (8*2+8), but Open MPI does not report this.

I am now clarifying this and I guess I'll then have to wait for @bgoglin insights.

bgoglin commented 1 year ago

Hello. hwloc reports different "cpukinds" (a cpuset + some info). We don't tell you explicitly which one is P or E (sometimes there are 3 kinds on ARM already), but kinds are reported in an order that goes from power-efficient cores to power-hungry cores. This is in hwloc/cpukinds.h since hwloc 2.4. You likely want to call hwloc_cpukinds_get_nr(topo, 0) to get the number of kinds, and then call hwloc_cpukinds_get_info(topo, nr-1, cpuset, NULL, NULL, NULL, 0) to get your (pre-allocated) cpuset filled with the list of power-hungry cores. This should work on Windows, Mac and Linux on ARM, Intel AlderLake and M1 although the way we detect heterogeneity is completely different in all these cases.

ggouaillardet commented 1 year ago

Thanks @bgoglin, I will experience on a M1 (since this is all I have) to see how I can "hide" the E cores from Open MPI.

ggouaillardet commented 1 year ago

@bgoglin just to be clear, does hwloc guarantees the highest cpukind (e.g. hwloc_cpukinds_get_nr(...) - 1) is for the power hungry (e.g. P) cores?

bgoglin commented 1 year ago

 * If hwloc fails to rank any kind, for instance because the operating
 * system does not expose efficiencies and core frequencies,
 * all kinds will have an unknown efficiency (\c -1),
 * and they are not indexed/ordered in any specific way.

So when you call get_info(), pass an "int efficiency" in hwloc_cpukinds_get_info(topo, nr-1, cpuset, &efficiency, NULL, NULL, 0) and check whether you get -1 in there.

rhc54 commented 1 year ago

there is something fishy here: according to the description, it should be 16 cores (8+8, unlike 8+4 I wrote earlier) and 24 threads (8*2+8), but Open MPI does not report this.

You cannot trust those dots, @ggouaillardet - the print statement isn't that accurate (it's actually rather dumb, to be honest).

how I can "hide" the E cores from Open MPI.

I already told you - you just have to list the PEs you want to use. It would take a significant change to PRRTE (or ORTE for an older version) to try and do this internally. I doubt it would be worth the effort - like I said, these chips are not intended for HPC, and won't run well in that environment.

liuzheng-arctic commented 1 year ago

Thanks for helping me to post my question here. I didn't intend to do real HPC job on this laptop but want to take advantage of the multiple cores to speed up some data processing (40k+ satellite data files and 200k+ model output files, less than 100M each). The processing is pretty repetitive and is perfect for lazy parallelization. The issue is that OpenMPI does not recognize the cores correctly. So I am not sure how it does the scheduling. OpenMPI complains when I set -np to more than 12. I don't want to have more threads on a single core, especially the e-core.

It would be great if I can use all 16 cores. If not, having some control over which cores to use would be ideal, for example, use p-cores for faster processing and e-cores for thermal concerns.

rhc54 commented 1 year ago

Yeah, I kinda figured that was the situation. You have a few simple options:

first, add --use-hwthread-cpus to the mpirun cmd line. This will even the playing field between the processor types.
By default, you'll bind each proc to a single thread, which means you can run up to 24 procs (8 p-cores have 16 threads, plus 8 single-thread e-cores). If you need 2 threads/proc, then tell mpirun to bind 2 cpus/rank: --map-by hwthread:pe=2. This will bind one proc to each p-core, and one proc to each pair of e-cores. Will limit you to 12 procs, though.
if you want to run up to 16 procs, on uneven numbers of threads, then you could try this: mpirun --map-by hwthread:pe=2 -np 8 myapp : --map-by hwthread -np 8 myapp. The first context should use both threads of each p-core, while the second context should use the single thread on each e-core (since the p-cores are all used up). Note that your performance won't be great as the procs will significantly differ in their behavior.
if you want more than 12 procs, you could just tell us not to bind at all: --map-by hwthread:oversubscribe --bind-to none. You lose a touch of performance due to not binding, but then you aren't going after great performance here anyway, and this let's the OS schedule the thread usage. Much simpler and the OS knows how to better use the different cores than anything we can provide.

liuzheng-arctic commented 1 year ago

Thanks for the reply, although I am not sure I can follow. What confuses me is that OpenMPI (and/or Ubuntu 22.04) can only see 12 cores (12x2 threads=24) although there are actually 16 cores (8x2+8x1=24 threads). If it gets the total number of cores wrong, it may mess up the scheduling to the cores too (missing four cores).

If OpenMPI can only see 12 cores, I assume mpirun -np 12 myapp should use the 12 cores with one processes per core.

What if I want to use all 16 cores, with one process on each? OpenMPI complains if I use mpirun -n 16 directly because it only sees 12 cores. If I use mpirun -n 16 --map-by hwthread:oversubscribe --bind-to none myapp. Will this be one process per core? I am worried that the OS or OpenMPI will only use the 12 cores it can see and have multiple threads on some of them, p-cores or e-cores.

Another question is why ?

rhc54 commented 1 year ago

You are overthinking things 😄

If you simply run mpirun -n 100 --oversubscribe you will launch 100 processes, none of them bound to any particular core. The OS will schedule as many of them at a time as it can fit onto CPUs, cycling time slices across all the procs in some kind of load-balanced manner. It will do this in a way that balances thermal load while providing best possible use of the cpu cycles.

You shouldn't care what hyperthread gets used for any given time slice by whatever process is being executed during that time slice. The OS will manage all that for you. This is what the OS does really well.

Trying to do any better than that is a waste of your energy. It doesn't matter what mpirun "sees" or doesn't "see". It's sole purpose is to start N procs, and then get out of the way and let the OS do its thing. Asking mpirun to try and optimize placement and binding on this kind of processor will only yield worse results.

liuzheng-arctic commented 1 year ago

Thanks, @rhc54 . I was worried that the OS is confused too because the Ubuntu 22.04 (5.15.79.1-microsoft-standard-WSL2) also sees only 12 cores (24 threads), although the host Windows 11 recognizes the CPU correctly. lscpu returns the following:

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Vendor ID: GenuineIntel Model name: 12th Gen Intel(R) Core(TM) i7-12800HX CPU family: 6 Model: 151 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 1 Stepping: 2 BogoMIPS: 4608.01 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse ss e2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop _tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline _timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi 2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni u mip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm serialize flush_l1d arch_capabili ties Virtualization features: Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full Caches (sum of all): L1d: 576 KiB (12 instances) L1i: 384 KiB (12 instances) L2: 15 MiB (12 instances) L3: 25 MiB (1 instance) Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; Enhanced IBRS Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected

rhc54 commented 1 year ago

Understood. The problem is that we cannot do any better than your OS is doing. No matter what options you pass to mpirun, I'm limited to what the OS thinks is present.

What you are seeing is the difference between Windows (being optimized to work with this architecture) and Ubuntu (which isn't). There is nothing anyone can do about that, I'm afraid - unless someone at Ubuntu wants to optimize the OS for this architecture, which I very much doubt.

Your only other option would be to switch to Microsoft's MPI, which operates under Windows. I don't know their licensing structure and it has been a long time since I heard from that team (so this product might not even still exist) - but if you can get a copy, that would support this chip.

Otherwise, the best you can do is like I said - just run it oversubscribed (with however many procs you think can run effectively - probably an experiment) and let the OS do the best it can.

ggouaillardet commented 1 year ago

Are you running native Linux? Or are you running Linux in a virtual machine (or WSL) ?

if the latter, that could explain why hwloc believes this is a 12 cores x 24 hyperthreads system

Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full

liuzheng-arctic commented 1 year ago

@ggouaillardet I am using Ubuntu 22.04 in WSL2. The kernel version is 5.15.79.1-microsoft-standard-WSL2. I check again using a Ubuntu 22.04 usb boot drive, it indeed sees all the 16 cores. I thought WSL only limit the amount of RAM not the number of cores.

bgoglin commented 1 year ago

Last time I saw hwloc running on WSL on Windows, Windows/Linux was reporting correct information in sysfs hence hwloc too. But I never tried on a hybrid machine. What's wrong above is lspcu. Either because Windows/Linux reports something wrong, or because lspcu isn't hybrid-aware yet. It sees 24 threads in the socket, 2 threads in first core, and decides that means 24/2=12 cores. Running lstopo would clarify this. Or at least ̀cat /sys/devices/system/cpu/cpu*/topology/thread_siblings

rhc54 commented 1 year ago

I'm not sure I agree with the assertion that lspcu is doing something "wrong". WSL isn't "limiting" the number of cores - it is simply logically grouping the available hyperthreads into two-HT "cores" - i.e., you have 12 "cores", each with 2 HTs. Native Ubuntu is logically grouping them into 8 "cores" each with 2HTs, and 8 "cores" each with 1HT. It all just depends on how the OS intends to manage/schedule the HTs. Neither is "correct" or "wrong" - they are just grouped differently.

If you have hyperthreading enabled (which you kinda have to do with this processor), it really won't matter as the kernel scheduling will be done at the HT level - so how they are grouped is irrelevant. What matters is if and how the kernel is scheduling the p-cores differently from the e-cores.

IIRC, Windows was customized to put compute-heavy process threads on the p-cores, and lighter operations on the e-cores. So as your job continued executing, it would migrate the more intense operations to the p-cores (e.g., your computational threads) and the less intense ones to the e-cores (e.g., threads performing disk IO, progress threads that tend to utilize little time, system daemons).

I'm not sure how Ubuntu is optimized - probably not as well customized, so it may just treat everything as equal and schedule a thread for execution on whatever hyperthread is available. Or it may do something similar to what Windows is doing.

Point being: the processor was designed with the expectation that the OS would migrate process threads to the "proper" HT for the kind of operations it was performing. In this architecture, the worse thing you can do is to try and preempt that migration. Best to just let the OS do its job. You just need to add the "oversubscribe" qualifier to the --map-by directive so that mpirun won't error out if you launch more procs than there are "cores" (or HTs if you pass the --use-hwthread-cpus option).

liuzheng-arctic commented 1 year ago

@bgoglin I think you are right that it might not be due to WSL limiting the number of available cores. If WSL limit the number of cores, it shouldn't see 24 threads. But lscpu returns the correct number of cores in Ubuntu 22.04 usb boot drive. So something else is wrong and it affects the number of cores available to OpenMPI under WSL.

ggouaillardet commented 1 year ago

The number of threads (24) is correct, so WSL is not limiting anything. But the topology might be altered (IIRC I saw that with KVM or VirtualBox): it shows 12x2 instead of 8x2+8.

Here is the info I requested on SO:

$ lstopo-no-graphics --version
$ lstopo-no-graphics --cpukinds
$ lstopo-no-graphics --no-io --of xml

rhc54 commented 1 year ago

it affects the number of cores available to OpenMPI under WSL

We seem to be spending a lot of time chasing ghosts on this thread, so I'll just crawl back under my rock. There is no limitation being set here. OMPI sees the same number of HTs on each system you have tried. mpirun just needs to be told to consider HTs as independent cpus so it can set oversubscription correctly. You don't want to bind your procs - you need to let the OS manage them for you. That is how the processor was designed to be used.

<me pulling the rock over my head>

liuzheng-arctic commented 1 year ago

lstopo_output.log @bgoglin $ lstopo-no-graphics --version returns:

lstopo-no-graphics 2.7.0 $ lstopo-no-graphics --cpukinds returns nothing $ lstopo-no-graphics --no-io --of xml returns is attached in the log file.

liuzheng-arctic commented 1 year ago

@rhc54 Thanks a lot for the explanations! I think I am more at ease when I use the OpenMPI on this machine now.

ggouaillardet commented 1 year ago

thanks, I confirm hwloc sees a single socket with 12 cores and 2 hyperthreads per core, so I guess WSL does not "pass through" the actual processor topology.

So I am afraid there is no trivial way to use 8P + 8E cores (e.g. ignore the second hyperthread on the P cores). Bottom line, mpirun --use-hwthread-as-cpus --bind-to none and let the OS (Linux via WSL) schedule the MPI tasks.

liuzheng-arctic commented 1 year ago

@ggouaillardet Thanks a ton! This helps a lot.

open-mpi / ompi

P vs E cores in Open MPI #11345