If on that CPU I run papi_hardware_avail in gdb and after some time kill it, it was at (note that it tries to get stuff for ID 48 though the system has only from 0 to 47):
Program received signal SIGINT, Interrupt.
__GI___access (file=0x5555559025c0 <pathbuf> "/sys/devices/system//cpu/cpu48/node876825", type=0) at ../sysdeps/unix/sysv/linux/access.c:27
27 ../sysdeps/unix/sysv/linux/access.c: No such file or directory.
(gdb) bt
#0 __GI___access (file=0x5555559025c0 <pathbuf> "/sys/devices/system//cpu/cpu48/node876825", type=0) at ../sysdeps/unix/sysv/linux/access.c:27
#1 0x00005555556bd6e4 in path_exist (path=0x5555557dd9f0 "/sys/devices/system//cpu/cpu%d/node%d") at components/sysdetect/linux_cpu_utils.c:859
#2 0x00005555556bd2ba in get_thread_affinity (thread=48, val=0x55555591d380) at components/sysdetect/linux_cpu_utils.c:770
#3 0x00005555556bafbf in linux_cpu_get_attribute_at (attr=CPU_ATTR__HWTHREAD_NUMA_AFFINITY, loc=48, value=0x55555591d380) at components/sysdetect/linux_cpu_utils.c:251
#4 0x00005555556baa24 in os_cpu_get_attribute_at (attr=CPU_ATTR__HWTHREAD_NUMA_AFFINITY, loc=48, value=0x55555591d380) at components/sysdetect/os_cpu_utils.c:45
#5 0x00005555556a0010 in cpuid_get_attribute_at (attr=CPU_ATTR__HWTHREAD_NUMA_AFFINITY, loc=48, value=0x55555591d380) at components/sysdetect/x86_cpu_utils.c:202
#6 0x000055555569fd6c in x86_cpu_get_attribute_at (attr=CPU_ATTR__HWTHREAD_NUMA_AFFINITY, loc=48, value=0x55555591d380) at components/sysdetect/x86_cpu_utils.c:102
#7 0x000055555569f4cd in cpu_get_attribute_at (attr=CPU_ATTR__HWTHREAD_NUMA_AFFINITY, loc=48, value=0x55555591d380) at components/sysdetect/cpu_utils.c:71
#8 0x000055555569efd2 in fill_cpu_info (info=0x55555591d010) at components/sysdetect/cpu.c:128
#9 0x000055555569f406 in open_cpu_dev_type (dev_type_info=0x555555908600 <dev_type_info_arr>) at components/sysdetect/cpu.c:151
#10 0x000055555569bc57 in init_dev_info () at components/sysdetect/sysdetect.c:61
#11 0x000055555569bef3 in _sysdetect_init_private () at components/sysdetect/sysdetect.c:124
#12 0x000055555569bf25 in _sysdetect_user (unused=0, in=0x7fffffff4860, out=0x7fffffff4a18) at components/sysdetect/sysdetect.c:135
#13 0x000055555567992a in _papi_hwi_enum_dev_type (enum_modifier=7, handle=0x7fffffff4a18) at papi_internal.c:2767
#14 0x000055555566ce19 in PAPI_enum_dev_type (enum_modifier=7, handle=0x7fffffff4a18) at papi.c:7333
#15 0x000055555564c164 in main ()
This is the output of lstopo for this CPU (physical indices):
Machine (70GB total)
Package P#0
NUMANode P#0 (47GB)
L3 P#0 (19MB)
L2 P#0 (1024KB) + L1d P#0 (32KB) + L1i P#0 (32KB) + Core P#0
PU P#0
PU P#24
L2 P#1 (1024KB) + L1d P#1 (32KB) + L1i P#1 (32KB) + Core P#1
PU P#1
PU P#25
L2 P#2 (1024KB) + L1d P#2 (32KB) + L1i P#2 (32KB) + Core P#2
PU P#2
PU P#26
L2 P#3 (1024KB) + L1d P#3 (32KB) + L1i P#3 (32KB) + Core P#3
PU P#3
PU P#27
L2 P#4 (1024KB) + L1d P#4 (32KB) + L1i P#4 (32KB) + Core P#4
PU P#4
PU P#28
L2 P#5 (1024KB) + L1d P#5 (32KB) + L1i P#5 (32KB) + Core P#5
PU P#5
PU P#29
L2 P#6 (1024KB) + L1d P#6 (32KB) + L1i P#6 (32KB) + Core P#6
PU P#6
PU P#30
L2 P#8 (1024KB) + L1d P#8 (32KB) + L1i P#8 (32KB) + Core P#8
PU P#7
PU P#31
L2 P#10 (1024KB) + L1d P#10 (32KB) + L1i P#10 (32KB) + Core P#10
PU P#8
PU P#32
L2 P#11 (1024KB) + L1d P#11 (32KB) + L1i P#11 (32KB) + Core P#11
PU P#9
PU P#33
L2 P#12 (1024KB) + L1d P#12 (32KB) + L1i P#12 (32KB) + Core P#12
PU P#10
PU P#34
L2 P#13 (1024KB) + L1d P#13 (32KB) + L1i P#13 (32KB) + Core P#13
PU P#11
PU P#35
HostBridge
PCI 00:17.0 (SATA)
PCIBridge
PCIBridge
PCI 02:00.0 (VGA)
PCIBridge
PCI 03:00.0 (SATA)
PCIBridge
PCI 05:00.0 (Ethernet)
Net "enp5s0"
PCIBridge
PCI 06:00.0 (Ethernet)
Net "enp6s0"
PCIBridge
PCI 07:00.0 (NVMExp)
Block(Disk) "nvme0n1"
HostBridge
PCIBridge
PCI 5e:00.0 (VGA)
CoProc(OpenCL) "opencl0d0"
Package P#1
NUMANode P#1 (24GB)
L3 P#1 (19MB)
L2 P#17 (1024KB) + L1d P#17 (32KB) + L1i P#17 (32KB) + Core P#1
PU P#12
PU P#36
L2 P#18 (1024KB) + L1d P#18 (32KB) + L1i P#18 (32KB) + Core P#2
PU P#13
PU P#37
L2 P#19 (1024KB) + L1d P#19 (32KB) + L1i P#19 (32KB) + Core P#3
PU P#14
PU P#38
L2 P#20 (1024KB) + L1d P#20 (32KB) + L1i P#20 (32KB) + Core P#4
PU P#15
PU P#39
L2 P#21 (1024KB) + L1d P#21 (32KB) + L1i P#21 (32KB) + Core P#5
PU P#16
PU P#40
L2 P#22 (1024KB) + L1d P#22 (32KB) + L1i P#22 (32KB) + Core P#6
PU P#17
PU P#41
L2 P#24 (1024KB) + L1d P#24 (32KB) + L1i P#24 (32KB) + Core P#8
PU P#18
PU P#42
L2 P#25 (1024KB) + L1d P#25 (32KB) + L1i P#25 (32KB) + Core P#9
PU P#19
PU P#43
L2 P#26 (1024KB) + L1d P#26 (32KB) + L1i P#26 (32KB) + Core P#10
PU P#20
PU P#44
L2 P#27 (1024KB) + L1d P#27 (32KB) + L1i P#27 (32KB) + Core P#11
PU P#21
PU P#45
L2 P#28 (1024KB) + L1d P#28 (32KB) + L1i P#28 (32KB) + Core P#12
PU P#22
PU P#46
L2 P#29 (1024KB) + L1d P#29 (32KB) + L1i P#29 (32KB) + Core P#13
PU P#23
PU P#47
We are observing an infinite loop at https://github.com/icl-utk-edu/papi/blob/afeb05966e68973a84ed58e80cf5515fcdb2dc0f/src/components/sysdetect/linux_cpu_utils.c#L769-L770 for the following CPU:
What happens is that
papi
wrongly identifies that the CPU has 13 cores per socket.After investigation, it is clearly due to the logical/physical IDs problems.
The part of the code that returns the socket/cores/thread count is not able to handle such a case. See https://github.com/icl-utk-edu/papi/blob/afeb05966e68973a84ed58e80cf5515fcdb2dc0f/src/components/sysdetect/x86_cpu_utils.c#L504-L510. These lines are wrong for this CPU because the core ID 13 is filled by the first socket because it does not have core ID 9, and core ID 9 is filled by the second socket that does not have core ID 7. Leading to 13 cores being counted, though only 12 are enabled.
If on that
CPU
I runpapi_hardware_avail
ingdb
and after some time kill it, it was at (note that it tries to get stuff for ID48
though the system has only from 0 to 47):This is the output of
lstopo
for this CPU (physical indices):