Open ggouaillardet opened 1 year ago
I remember raising this while at Intel - IIRC, the answer was "nobody should be using these processors for MPI". Not really designed for that purpose. Best we could devise was to use the "pe-list" option to select only the p-cores as the e-cores are pretty useless for this application. It's a workaround, but probably your best answer if you insist on using such processors for HPC.
My guess is that someone is just trying to run code on a laptop for test purposes - in which case, restricting to the p-cores is probably just fine.
I am fine with using only the P cores for Open MPI.
I do not have access to such a processor and I do not know how hwloc
presents it to Open MPI. Is it seen as a 8 (P) core systems? or as a 12 (8P + 4E) core systems?
FWIW, I asked the user to run mpirun --display-map -np 1 true
to check whether Open MPI sees the E cores.
I honestly don't know how it is presented. I couldn't get the processor team to have any interest in hwloc support back then. The processor was designed in partnership with Microsoft specifically for Windows (which MS custom optimized for it), and MS had no interest in hwloc support.
I'm guessing hwloc should still be able to read something on it anyway. If they have hwloc on that box, then just have them run lstopo
and provide the output from it - that's all we get anyway.
$ mpirun --display-map -n 1 true
Data for JOB [3978,1] offset 0 Total slots allocated 12
======================== JOB MAP ========================
Data for node: xxxxxxxx Num slots: 12 Max slots: 0 Num procs: 1
Process OMPI jobid: [3978,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../../../../../../../../../..]
=============================================================
there is something fishy here: according to the description, it should be 16 cores (8+8, unlike 8+4 I wrote earlier) and 24 threads (8*2+8), but Open MPI does not report this.
I am now clarifying this and I guess I'll then have to wait for @bgoglin insights.
Hello. hwloc reports different "cpukinds" (a cpuset + some info). We don't tell you explicitly which one is P or E (sometimes there are 3 kinds on ARM already), but kinds are reported in an order that goes from power-efficient cores to power-hungry cores. This is in hwloc/cpukinds.h since hwloc 2.4. You likely want to call hwloc_cpukinds_get_nr(topo, 0) to get the number of kinds, and then call hwloc_cpukinds_get_info(topo, nr-1, cpuset, NULL, NULL, NULL, 0) to get your (pre-allocated) cpuset filled with the list of power-hungry cores. This should work on Windows, Mac and Linux on ARM, Intel AlderLake and M1 although the way we detect heterogeneity is completely different in all these cases.
Thanks @bgoglin, I will experience on a M1 (since this is all I have) to see how I can "hide" the E cores from Open MPI.
@bgoglin just to be clear, does hwloc
guarantees the highest cpukind
(e.g. hwloc_cpukinds_get_nr(...) - 1
) is for the power hungry (e.g. P) cores?
* If hwloc fails to rank any kind, for instance because the operating
* system does not expose efficiencies and core frequencies,
* all kinds will have an unknown efficiency (\c -1),
* and they are not indexed/ordered in any specific way.
So when you call get_info(), pass an "int efficiency" in hwloc_cpukinds_get_info(topo, nr-1, cpuset, &efficiency, NULL, NULL, 0) and check whether you get -1 in there.
there is something fishy here: according to the description, it should be 16 cores (8+8, unlike 8+4 I wrote earlier) and 24 threads (8*2+8), but Open MPI does not report this.
You cannot trust those dots, @ggouaillardet - the print statement isn't that accurate (it's actually rather dumb, to be honest).
how I can "hide" the E cores from Open MPI.
I already told you - you just have to list the PEs you want to use. It would take a significant change to PRRTE (or ORTE for an older version) to try and do this internally. I doubt it would be worth the effort - like I said, these chips are not intended for HPC, and won't run well in that environment.
Thanks for helping me to post my question here. I didn't intend to do real HPC job on this laptop but want to take advantage of the multiple cores to speed up some data processing (40k+ satellite data files and 200k+ model output files, less than 100M each). The processing is pretty repetitive and is perfect for lazy parallelization. The issue is that OpenMPI does not recognize the cores correctly. So I am not sure how it does the scheduling. OpenMPI complains when I set -np
to more than 12. I don't want to have more threads on a single core, especially the e-core.
It would be great if I can use all 16 cores. If not, having some control over which cores to use would be ideal, for example, use p-cores for faster processing and e-cores for thermal concerns.
Yeah, I kinda figured that was the situation. You have a few simple options:
first, add --use-hwthread-cpus
to the mpirun
cmd line. This will even the playing field between the processor types.
By default, you'll bind each proc to a single thread, which means you can run up to 24 procs (8 p-cores have 16 threads, plus 8 single-thread e-cores). If you need 2 threads/proc, then tell mpirun
to bind 2 cpus/rank: --map-by hwthread:pe=2
. This will bind one proc to each p-core, and one proc to each pair of e-cores. Will limit you to 12 procs, though.
if you want to run up to 16 procs, on uneven numbers of threads, then you could try this: mpirun --map-by hwthread:pe=2 -np 8 myapp : --map-by hwthread -np 8 myapp
. The first context should use both threads of each p-core, while the second context should use the single thread on each e-core (since the p-cores are all used up). Note that your performance won't be great as the procs will significantly differ in their behavior.
if you want more than 12 procs, you could just tell us not to bind at all: --map-by hwthread:oversubscribe --bind-to none
. You lose a touch of performance due to not binding, but then you aren't going after great performance here anyway, and this let's the OS schedule the thread usage. Much simpler and the OS knows how to better use the different cores than anything we can provide.
Thanks for the reply, although I am not sure I can follow. What confuses me is that OpenMPI (and/or Ubuntu 22.04) can only see 12 cores (12x2 threads=24) although there are actually 16 cores (8x2+8x1=24 threads). If it gets the total number of cores wrong, it may mess up the scheduling to the cores too (missing four cores).
If OpenMPI can only see 12 cores, I assume mpirun -np 12 myapp
should use the 12 cores with one processes per core.
What if I want to use all 16 cores, with one process on each? OpenMPI complains if I use mpirun -n 16
directly because it only sees 12 cores. If I use mpirun -n 16 --map-by hwthread:oversubscribe --bind-to none myapp
. Will this be one process per core? I am worried that the OS or OpenMPI will only use the 12 cores it can see and have multiple threads on some of them, p-cores or e-cores.
Another question is why ?
You are overthinking things 😄
If you simply run mpirun -n 100 --oversubscribe
you will launch 100 processes, none of them bound to any particular core. The OS will schedule as many of them at a time as it can fit onto CPUs, cycling time slices across all the procs in some kind of load-balanced manner. It will do this in a way that balances thermal load while providing best possible use of the cpu cycles.
You shouldn't care what hyperthread gets used for any given time slice by whatever process is being executed during that time slice. The OS will manage all that for you. This is what the OS does really well.
Trying to do any better than that is a waste of your energy. It doesn't matter what mpirun "sees" or doesn't "see". It's sole purpose is to start N procs, and then get out of the way and let the OS do its thing. Asking mpirun to try and optimize placement and binding on this kind of processor will only yield worse results.
Thanks, @rhc54 . I was worried that the OS is confused too because the Ubuntu 22.04 (5.15.79.1-microsoft-standard-WSL2) also sees only 12 cores (24 threads), although the host Windows 11 recognizes the CPU correctly.
lscpu
returns the following:
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Vendor ID: GenuineIntel Model name: 12th Gen Intel(R) Core(TM) i7-12800HX CPU family: 6 Model: 151 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 1 Stepping: 2 BogoMIPS: 4608.01 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse ss e2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop _tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline _timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi 2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni u mip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm serialize flush_l1d arch_capabili ties Virtualization features: Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full Caches (sum of all): L1d: 576 KiB (12 instances) L1i: 384 KiB (12 instances) L2: 15 MiB (12 instances) L3: 25 MiB (1 instance) Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; Enhanced IBRS Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected
Understood. The problem is that we cannot do any better than your OS is doing. No matter what options you pass to mpirun, I'm limited to what the OS thinks is present.
What you are seeing is the difference between Windows (being optimized to work with this architecture) and Ubuntu (which isn't). There is nothing anyone can do about that, I'm afraid - unless someone at Ubuntu wants to optimize the OS for this architecture, which I very much doubt.
Your only other option would be to switch to Microsoft's MPI, which operates under Windows. I don't know their licensing structure and it has been a long time since I heard from that team (so this product might not even still exist) - but if you can get a copy, that would support this chip.
Otherwise, the best you can do is like I said - just run it oversubscribed (with however many procs you think can run effectively - probably an experiment) and let the OS do the best it can.
Are you running native Linux? Or are you running Linux in a virtual machine (or WSL) ?
if the latter, that could explain why hwloc
believes this is a 12 cores x 24 hyperthreads system
Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full
@ggouaillardet I am using Ubuntu 22.04 in WSL2. The kernel version is 5.15.79.1-microsoft-standard-WSL2. I check again using a Ubuntu 22.04 usb boot drive, it indeed sees all the 16 cores. I thought WSL only limit the amount of RAM not the number of cores.
Last time I saw hwloc running on WSL on Windows, Windows/Linux was reporting correct information in sysfs hence hwloc too. But I never tried on a hybrid machine. What's wrong above is lspcu. Either because Windows/Linux reports something wrong, or because lspcu isn't hybrid-aware yet. It sees 24 threads in the socket, 2 threads in first core, and decides that means 24/2=12 cores. Running lstopo would clarify this. Or at least ̀cat /sys/devices/system/cpu/cpu*/topology/thread_siblings
I'm not sure I agree with the assertion that lspcu is doing something "wrong". WSL isn't "limiting" the number of cores - it is simply logically grouping the available hyperthreads into two-HT "cores" - i.e., you have 12 "cores", each with 2 HTs. Native Ubuntu is logically grouping them into 8 "cores" each with 2HTs, and 8 "cores" each with 1HT. It all just depends on how the OS intends to manage/schedule the HTs. Neither is "correct" or "wrong" - they are just grouped differently.
If you have hyperthreading enabled (which you kinda have to do with this processor), it really won't matter as the kernel scheduling will be done at the HT level - so how they are grouped is irrelevant. What matters is if and how the kernel is scheduling the p-cores differently from the e-cores.
IIRC, Windows was customized to put compute-heavy process threads on the p-cores, and lighter operations on the e-cores. So as your job continued executing, it would migrate the more intense operations to the p-cores (e.g., your computational threads) and the less intense ones to the e-cores (e.g., threads performing disk IO, progress threads that tend to utilize little time, system daemons).
I'm not sure how Ubuntu is optimized - probably not as well customized, so it may just treat everything as equal and schedule a thread for execution on whatever hyperthread is available. Or it may do something similar to what Windows is doing.
Point being: the processor was designed with the expectation that the OS would migrate process threads to the "proper" HT for the kind of operations it was performing. In this architecture, the worse thing you can do is to try and preempt that migration. Best to just let the OS do its job. You just need to add the "oversubscribe" qualifier to the --map-by directive so that mpirun won't error out if you launch more procs than there are "cores" (or HTs if you pass the --use-hwthread-cpus option).
@bgoglin I think you are right that it might not be due to WSL limiting the number of available cores. If WSL limit the number of cores, it shouldn't see 24 threads. But lscpu
returns the correct number of cores in Ubuntu 22.04 usb boot drive. So something else is wrong and it affects the number of cores available to OpenMPI under WSL.
The number of threads (24) is correct, so WSL is not limiting anything. But the topology might be altered (IIRC I saw that with KVM or VirtualBox): it shows 12x2 instead of 8x2+8.
Here is the info I requested on SO:
$ lstopo-no-graphics --version
$ lstopo-no-graphics --cpukinds
$ lstopo-no-graphics --no-io --of xml
it affects the number of cores available to OpenMPI under WSL
We seem to be spending a lot of time chasing ghosts on this thread, so I'll just crawl back under my rock. There is no limitation being set here. OMPI sees the same number of HTs on each system you have tried. mpirun just needs to be told to consider HTs as independent cpus so it can set oversubscription correctly. You don't want to bind your procs - you need to let the OS manage them for you. That is how the processor was designed to be used.
<me pulling the rock over my head>
lstopo_output.log
@bgoglin
$ lstopo-no-graphics --version
returns:
lstopo-no-graphics 2.7.0
$ lstopo-no-graphics --cpukinds
returns nothing$ lstopo-no-graphics --no-io --of xml
returns is attached in the log file.
@rhc54 Thanks a lot for the explanations! I think I am more at ease when I use the OpenMPI on this machine now.
thanks, I confirm hwloc
sees a single socket with 12 cores and 2 hyperthreads per core, so I guess WSL does not "pass through" the actual processor topology.
So I am afraid there is no trivial way to use 8P + 8E cores (e.g. ignore the second hyperthread on the P cores).
Bottom line, mpirun --use-hwthread-as-cpus --bind-to none
and let the OS (Linux via WSL) schedule the MPI tasks.
@ggouaillardet Thanks a ton! This helps a lot.
I just saw this question in Stack Overflow
https://stackoverflow.com/questions/75240988/openmpi-and-ubuntu-22-04-support-for-using-all-e-and-p-cores-on-12th-gen-intel-c
TL;DR on a system with 8 P cores (2 threads each) and 8 E cores (1 thread each), is there a (ideally user friendly) way to tell Open MPI to only use the P cores?
@bgoglin what kind of support is provided by
hwloc
with respect to P vs E cores?