icl-utk-edu / papi

Other
101 stars 45 forks source link

`papi` enters an infinite loop when logical and physical core ID disagree #241

Open romintomasetti opened 5 hours ago

romintomasetti commented 5 hours ago

We are observing an infinite loop at https://github.com/icl-utk-edu/papi/blob/afeb05966e68973a84ed58e80cf5515fcdb2dc0f/src/components/sysdetect/linux_cpu_utils.c#L769-L770 for the following CPU:

Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
    CPU family:           6
    Model:                85
    Thread(s) per core:   2
    Core(s) per socket:   12
    Socket(s):            2

What happens is that papi wrongly identifies that the CPU has 13 cores per socket.

After investigation, it is clearly due to the logical/physical IDs problems.

The part of the code that returns the socket/cores/thread count is not able to handle such a case. See https://github.com/icl-utk-edu/papi/blob/afeb05966e68973a84ed58e80cf5515fcdb2dc0f/src/components/sysdetect/x86_cpu_utils.c#L504-L510. These lines are wrong for this CPU because the core ID 13 is filled by the first socket because it does not have core ID 9, and core ID 9 is filled by the second socket that does not have core ID 7. Leading to 13 cores being counted, though only 12 are enabled.

If on that CPU I run papi_hardware_avail in gdb and after some time kill it, it was at (note that it tries to get stuff for ID 48 though the system has only from 0 to 47):

Program received signal SIGINT, Interrupt.
__GI___access (file=0x5555559025c0 <pathbuf> "/sys/devices/system//cpu/cpu48/node876825", type=0) at ../sysdeps/unix/sysv/linux/access.c:27
27      ../sysdeps/unix/sysv/linux/access.c: No such file or directory.
(gdb) bt
#0  __GI___access (file=0x5555559025c0 <pathbuf> "/sys/devices/system//cpu/cpu48/node876825", type=0) at ../sysdeps/unix/sysv/linux/access.c:27
#1  0x00005555556bd6e4 in path_exist (path=0x5555557dd9f0 "/sys/devices/system//cpu/cpu%d/node%d") at components/sysdetect/linux_cpu_utils.c:859
#2  0x00005555556bd2ba in get_thread_affinity (thread=48, val=0x55555591d380) at components/sysdetect/linux_cpu_utils.c:770
#3  0x00005555556bafbf in linux_cpu_get_attribute_at (attr=CPU_ATTR__HWTHREAD_NUMA_AFFINITY, loc=48, value=0x55555591d380) at components/sysdetect/linux_cpu_utils.c:251
#4  0x00005555556baa24 in os_cpu_get_attribute_at (attr=CPU_ATTR__HWTHREAD_NUMA_AFFINITY, loc=48, value=0x55555591d380) at components/sysdetect/os_cpu_utils.c:45
#5  0x00005555556a0010 in cpuid_get_attribute_at (attr=CPU_ATTR__HWTHREAD_NUMA_AFFINITY, loc=48, value=0x55555591d380) at components/sysdetect/x86_cpu_utils.c:202
#6  0x000055555569fd6c in x86_cpu_get_attribute_at (attr=CPU_ATTR__HWTHREAD_NUMA_AFFINITY, loc=48, value=0x55555591d380) at components/sysdetect/x86_cpu_utils.c:102
#7  0x000055555569f4cd in cpu_get_attribute_at (attr=CPU_ATTR__HWTHREAD_NUMA_AFFINITY, loc=48, value=0x55555591d380) at components/sysdetect/cpu_utils.c:71
#8  0x000055555569efd2 in fill_cpu_info (info=0x55555591d010) at components/sysdetect/cpu.c:128
#9  0x000055555569f406 in open_cpu_dev_type (dev_type_info=0x555555908600 <dev_type_info_arr>) at components/sysdetect/cpu.c:151
#10 0x000055555569bc57 in init_dev_info () at components/sysdetect/sysdetect.c:61
#11 0x000055555569bef3 in _sysdetect_init_private () at components/sysdetect/sysdetect.c:124
#12 0x000055555569bf25 in _sysdetect_user (unused=0, in=0x7fffffff4860, out=0x7fffffff4a18) at components/sysdetect/sysdetect.c:135
#13 0x000055555567992a in _papi_hwi_enum_dev_type (enum_modifier=7, handle=0x7fffffff4a18) at papi_internal.c:2767
#14 0x000055555566ce19 in PAPI_enum_dev_type (enum_modifier=7, handle=0x7fffffff4a18) at papi.c:7333
#15 0x000055555564c164 in main ()

This is the output of lstopo for this CPU (physical indices):

Machine (70GB total)
  Package P#0
    NUMANode P#0 (47GB)
    L3 P#0 (19MB)
      L2 P#0 (1024KB) + L1d P#0 (32KB) + L1i P#0 (32KB) + Core P#0
        PU P#0
        PU P#24
      L2 P#1 (1024KB) + L1d P#1 (32KB) + L1i P#1 (32KB) + Core P#1
        PU P#1
        PU P#25
      L2 P#2 (1024KB) + L1d P#2 (32KB) + L1i P#2 (32KB) + Core P#2
        PU P#2
        PU P#26
      L2 P#3 (1024KB) + L1d P#3 (32KB) + L1i P#3 (32KB) + Core P#3
        PU P#3
        PU P#27
      L2 P#4 (1024KB) + L1d P#4 (32KB) + L1i P#4 (32KB) + Core P#4
        PU P#4
        PU P#28
      L2 P#5 (1024KB) + L1d P#5 (32KB) + L1i P#5 (32KB) + Core P#5
        PU P#5
        PU P#29
      L2 P#6 (1024KB) + L1d P#6 (32KB) + L1i P#6 (32KB) + Core P#6
        PU P#6
        PU P#30
      L2 P#8 (1024KB) + L1d P#8 (32KB) + L1i P#8 (32KB) + Core P#8
        PU P#7
        PU P#31
      L2 P#10 (1024KB) + L1d P#10 (32KB) + L1i P#10 (32KB) + Core P#10
        PU P#8
        PU P#32
      L2 P#11 (1024KB) + L1d P#11 (32KB) + L1i P#11 (32KB) + Core P#11
        PU P#9
        PU P#33
      L2 P#12 (1024KB) + L1d P#12 (32KB) + L1i P#12 (32KB) + Core P#12
        PU P#10
        PU P#34
      L2 P#13 (1024KB) + L1d P#13 (32KB) + L1i P#13 (32KB) + Core P#13
        PU P#11
        PU P#35
    HostBridge
      PCI 00:17.0 (SATA)
      PCIBridge
        PCIBridge
          PCI 02:00.0 (VGA)
      PCIBridge
        PCI 03:00.0 (SATA)
      PCIBridge
        PCI 05:00.0 (Ethernet)
          Net "enp5s0"
      PCIBridge
        PCI 06:00.0 (Ethernet)
          Net "enp6s0"
      PCIBridge
        PCI 07:00.0 (NVMExp)
          Block(Disk) "nvme0n1"
    HostBridge
      PCIBridge
        PCI 5e:00.0 (VGA)
          CoProc(OpenCL) "opencl0d0"
  Package P#1
    NUMANode P#1 (24GB)
    L3 P#1 (19MB)
      L2 P#17 (1024KB) + L1d P#17 (32KB) + L1i P#17 (32KB) + Core P#1
        PU P#12
        PU P#36
      L2 P#18 (1024KB) + L1d P#18 (32KB) + L1i P#18 (32KB) + Core P#2
        PU P#13
        PU P#37
      L2 P#19 (1024KB) + L1d P#19 (32KB) + L1i P#19 (32KB) + Core P#3
        PU P#14
        PU P#38
      L2 P#20 (1024KB) + L1d P#20 (32KB) + L1i P#20 (32KB) + Core P#4
        PU P#15
        PU P#39
      L2 P#21 (1024KB) + L1d P#21 (32KB) + L1i P#21 (32KB) + Core P#5
        PU P#16
        PU P#40
      L2 P#22 (1024KB) + L1d P#22 (32KB) + L1i P#22 (32KB) + Core P#6
        PU P#17
        PU P#41
      L2 P#24 (1024KB) + L1d P#24 (32KB) + L1i P#24 (32KB) + Core P#8
        PU P#18
        PU P#42
      L2 P#25 (1024KB) + L1d P#25 (32KB) + L1i P#25 (32KB) + Core P#9
        PU P#19
        PU P#43
      L2 P#26 (1024KB) + L1d P#26 (32KB) + L1i P#26 (32KB) + Core P#10
        PU P#20
        PU P#44
      L2 P#27 (1024KB) + L1d P#27 (32KB) + L1i P#27 (32KB) + Core P#11
        PU P#21
        PU P#45
      L2 P#28 (1024KB) + L1d P#28 (32KB) + L1i P#28 (32KB) + Core P#12
        PU P#22
        PU P#46
      L2 P#29 (1024KB) + L1d P#29 (32KB) + L1i P#29 (32KB) + Core P#13
        PU P#23
        PU P#47
romintomasetti commented 5 hours ago

Tagging @maartenarnst because he will be interested.