pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
534 stars 280 forks source link

MPICH cpu-binding possible issue / unexpected behavior #7067

Closed colleeneb closed 1 week ago

colleeneb commented 1 month ago

Summary

This is to report a possible issue / unexpected behavior with AMD-GPU-aware MPICH on an Supermicro AS-4124GQ-TNMI (2x AMD EPYC 7713 with 4 AMD Instinct MI250). @raffenet kindly built MPICH there and it works fine but we noticed some odd behavior with the cpu-binding. I was trying to get each rank to bind in a consecutive manner to the 16 cores, i.e. rank 0 to cores 0 to 15 and rank 1 to cores 16-31. From my understanding this should be mpirun -n 2 -bind-to user:0-15,16-31. When I tried this however, the output of HYDRA_TOPO_DEBUG looks correct but with a code we’ve used before to check affinity (https://github.com/argonne-lcf/GettingStarted/blob/master/Examples/Theta/affinity/main.cpp) it acts like it's not binding correctly, and it changes when we run multiple times. We use the affinity code at ALCF for many systems, so it's unlikely (but always possible!) that there is an issue with it. I was able to confirm with htop that the ranks weren't running on their set of cores as well.

Reproducer

(the module is specific to our system but it loads MPICH with AMD GPU support)

module load mpich/4.2.2-rocm6.1-gcc

wget https://raw.githubusercontent.com/argonne-lcf/GettingStarted/master/Examples/Theta/affinity/main.cpp

MPICH_CXX=hipcc mpicxx -fopenmp main.cpp

export OMP_NUM_THREADS=1

HYDRA_TOPO_DEBUG=1 mpirun -n 2 -bind-to user:0-15,16-31 ./a.out

HYDRA_TOPO_DEBUG=1 mpirun -n 2 -bind-to user:0-15,16-31 ./a.out

Expected Output

We expect the affinity code to print list_cores= (0-15) for rank 0 and list_cores= (16-31) for rank 1, like:

[proxy:0@amdgpu04] created hwloc xml file /tmp/hydra_hwloc_xmlfile_tDWry3
process 0 binding: 11111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000000000000
process 1 binding: 00000000000000001111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000000000000
To affinity and beyond!! nname= amdgpu04  rnk= 0  tid= 0: list_cores= (0-15)

To affinity and beyond!! nname= amdgpu04  rnk= 1  tid= 0: list_cores= (16-31)

[proxy:0@amdgpu04] removed file /tmp/hydra_hwloc_xmlfile_tDWry3
[proxy:0@amdgpu04] created hwloc xml file /tmp/hydra_hwloc_xmlfile_3uSoR3
process 0 binding: 11111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000000000000
process 1 binding: 00000000000000001111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000000000000
To affinity and beyond!! nname= amdgpu04  rnk= 0  tid= 0: list_cores= (0-15)

To affinity and beyond!! nname= amdgpu04  rnk= 1  tid= 0: list_cores= (16-31)

[proxy:0@amdgpu04] removed file /tmp/hydra_hwloc_xmlfile_3uSoR3

Actual Output

As we can see below, the first run, list_cores= (0-15) for rank 0 but list_cores= (0-255) for rank 1. For the second run, list_cores= (0-255) for rank 0 and list_cores= (16-31) for rank 1.

[proxy:0@amdgpu04] created hwloc xml file /tmp/hydra_hwloc_xmlfile_tDWry3
process 0 binding: 11111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000000000000
process 1 binding: 00000000000000001111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000000000000
To affinity and beyond!! nname= amdgpu04  rnk= 0  tid= 0: list_cores= (0-15)

To affinity and beyond!! nname= amdgpu04  rnk= 1  tid= 0: list_cores= (0-255)

[proxy:0@amdgpu04] removed file /tmp/hydra_hwloc_xmlfile_tDWry3
[proxy:0@amdgpu04] created hwloc xml file /tmp/hydra_hwloc_xmlfile_3uSoR3
process 0 binding: 11111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000000000000
process 1 binding: 00000000000000001111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000000000000
To affinity and beyond!! nname= amdgpu04  rnk= 0  tid= 0: list_cores= (0-255)

To affinity and beyond!! nname= amdgpu04  rnk= 1  tid= 0: list_cores= (16-31)

[proxy:0@amdgpu04] removed file /tmp/hydra_hwloc_xmlfile_3uSoR3
hzhou commented 1 month ago

What is the output from lstopo (a utility from hwloc)?

colleeneb commented 1 month ago

Thanks! lstopo is

> lstopo
Machine (502GB total)
  Package L#0
    NUMANode L#0 (P#0 251GB)
    L3 L#0 (32MB)
      L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#128)
      L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#129)
      L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#130)
      L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#131)
      L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#132)
      L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#133)
      L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#134)
      L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#135)
    L3 L#1 (32MB)
      L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#136)
      L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#137)
      L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#138)
      L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#139)
      L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#140)
      L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#141)
      L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#142)
      L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#143)
    L3 L#2 (32MB)
      L2 L#16 (512KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
        PU L#32 (P#16)
        PU L#33 (P#144)
      L2 L#17 (512KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
        PU L#34 (P#17)
        PU L#35 (P#145)
      L2 L#18 (512KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
        PU L#36 (P#18)
        PU L#37 (P#146)
      L2 L#19 (512KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
        PU L#38 (P#19)
        PU L#39 (P#147)
      L2 L#20 (512KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
        PU L#40 (P#20)
        PU L#41 (P#148)
      L2 L#21 (512KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
        PU L#42 (P#21)
        PU L#43 (P#149)
      L2 L#22 (512KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
        PU L#44 (P#22)
        PU L#45 (P#150)
      L2 L#23 (512KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
        PU L#46 (P#23)
        PU L#47 (P#151)
    L3 L#3 (32MB)
      L2 L#24 (512KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
        PU L#48 (P#24)
        PU L#49 (P#152)
      L2 L#25 (512KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
        PU L#50 (P#25)
        PU L#51 (P#153)
      L2 L#26 (512KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
        PU L#52 (P#26)
        PU L#53 (P#154)
      L2 L#27 (512KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
        PU L#54 (P#27)
        PU L#55 (P#155)
      L2 L#28 (512KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
        PU L#56 (P#28)
        PU L#57 (P#156)
      L2 L#29 (512KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
        PU L#58 (P#29)
        PU L#59 (P#157)
      L2 L#30 (512KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
        PU L#60 (P#30)
        PU L#61 (P#158)
      L2 L#31 (512KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
        PU L#62 (P#31)
        PU L#63 (P#159)
    L3 L#4 (32MB)
      L2 L#32 (512KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
        PU L#64 (P#32)
        PU L#65 (P#160)
      L2 L#33 (512KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
        PU L#66 (P#33)
        PU L#67 (P#161)
      L2 L#34 (512KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
        PU L#68 (P#34)
        PU L#69 (P#162)
      L2 L#35 (512KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
        PU L#70 (P#35)
        PU L#71 (P#163)
      L2 L#36 (512KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36
        PU L#72 (P#36)
        PU L#73 (P#164)
      L2 L#37 (512KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37
        PU L#74 (P#37)
        PU L#75 (P#165)
      L2 L#38 (512KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
        PU L#76 (P#38)
        PU L#77 (P#166)
      L2 L#39 (512KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
        PU L#78 (P#39)
        PU L#79 (P#167)
    L3 L#5 (32MB)
      L2 L#40 (512KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40
        PU L#80 (P#40)
        PU L#81 (P#168)
      L2 L#41 (512KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41
        PU L#82 (P#41)
        PU L#83 (P#169)
      L2 L#42 (512KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42
        PU L#84 (P#42)
        PU L#85 (P#170)
      L2 L#43 (512KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43
        PU L#86 (P#43)
        PU L#87 (P#171)
      L2 L#44 (512KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44
        PU L#88 (P#44)
        PU L#89 (P#172)
      L2 L#45 (512KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45
        PU L#90 (P#45)
        PU L#91 (P#173)
      L2 L#46 (512KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46
        PU L#92 (P#46)
        PU L#93 (P#174)
      L2 L#47 (512KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47
        PU L#94 (P#47)
        PU L#95 (P#175)
    L3 L#6 (32MB)
      L2 L#48 (512KB) + L1d L#48 (32KB) + L1i L#48 (32KB) + Core L#48
        PU L#96 (P#48)
        PU L#97 (P#176)
      L2 L#49 (512KB) + L1d L#49 (32KB) + L1i L#49 (32KB) + Core L#49
        PU L#98 (P#49)
        PU L#99 (P#177)
      L2 L#50 (512KB) + L1d L#50 (32KB) + L1i L#50 (32KB) + Core L#50
        PU L#100 (P#50)
        PU L#101 (P#178)
      L2 L#51 (512KB) + L1d L#51 (32KB) + L1i L#51 (32KB) + Core L#51
        PU L#102 (P#51)
        PU L#103 (P#179)
      L2 L#52 (512KB) + L1d L#52 (32KB) + L1i L#52 (32KB) + Core L#52
        PU L#104 (P#52)
        PU L#105 (P#180)
      L2 L#53 (512KB) + L1d L#53 (32KB) + L1i L#53 (32KB) + Core L#53
        PU L#106 (P#53)
        PU L#107 (P#181)
      L2 L#54 (512KB) + L1d L#54 (32KB) + L1i L#54 (32KB) + Core L#54
        PU L#108 (P#54)
        PU L#109 (P#182)
      L2 L#55 (512KB) + L1d L#55 (32KB) + L1i L#55 (32KB) + Core L#55
        PU L#110 (P#55)
        PU L#111 (P#183)
    L3 L#7 (32MB)
      L2 L#56 (512KB) + L1d L#56 (32KB) + L1i L#56 (32KB) + Core L#56
        PU L#112 (P#56)
        PU L#113 (P#184)
      L2 L#57 (512KB) + L1d L#57 (32KB) + L1i L#57 (32KB) + Core L#57
        PU L#114 (P#57)
        PU L#115 (P#185)
      L2 L#58 (512KB) + L1d L#58 (32KB) + L1i L#58 (32KB) + Core L#58
        PU L#116 (P#58)
        PU L#117 (P#186)
      L2 L#59 (512KB) + L1d L#59 (32KB) + L1i L#59 (32KB) + Core L#59
        PU L#118 (P#59)
        PU L#119 (P#187)
      L2 L#60 (512KB) + L1d L#60 (32KB) + L1i L#60 (32KB) + Core L#60
        PU L#120 (P#60)
        PU L#121 (P#188)
      L2 L#61 (512KB) + L1d L#61 (32KB) + L1i L#61 (32KB) + Core L#61
        PU L#122 (P#61)
        PU L#123 (P#189)
      L2 L#62 (512KB) + L1d L#62 (32KB) + L1i L#62 (32KB) + Core L#62
        PU L#124 (P#62)
        PU L#125 (P#190)
      L2 L#63 (512KB) + L1d L#63 (32KB) + L1i L#63 (32KB) + Core L#63
        PU L#126 (P#63)
        PU L#127 (P#191)
    HostBridge
      PCIBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCIBridge
                PCI 05:00.0 (Ethernet)
                  Net "eth0"
                PCI 05:00.1 (Ethernet)
                  Net "eth1"
              PCIBridge
                PCI 06:00.0 (NVMExp)
                  Block(Disk) "nvme0n1"
          PCIBridge
            PCIBridge
              PCIBridge
                PCIBridge
                  PCIBridge
                    PCI 11:00.0 (Display)
              PCIBridge
                PCIBridge
                  PCIBridge
                    PCI 14:00.0 (Display)
          PCIBridge
            PCI 19:00.0 (Storage)
    HostBridge
      PCIBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCIBridge
                PCIBridge
                  PCIBridge
                    PCI 31:00.0 (Display)
              PCIBridge
                PCIBridge
                  PCIBridge
                    PCI 34:00.0 (Display)
          PCIBridge
            PCI 39:00.0 (Storage)
    HostBridge
      PCIBridge
        PCI 43:00.0 (SATA)
    HostBridge
      PCIBridge
        PCIBridge
          PCI 62:00.0 (VGA)
  Package L#1
    NUMANode L#1 (P#1 251GB)
    L3 L#8 (32MB)
      L2 L#64 (512KB) + L1d L#64 (32KB) + L1i L#64 (32KB) + Core L#64
        PU L#128 (P#64)
        PU L#129 (P#192)
      L2 L#65 (512KB) + L1d L#65 (32KB) + L1i L#65 (32KB) + Core L#65
        PU L#130 (P#65)
        PU L#131 (P#193)
      L2 L#66 (512KB) + L1d L#66 (32KB) + L1i L#66 (32KB) + Core L#66
        PU L#132 (P#66)
        PU L#133 (P#194)
      L2 L#67 (512KB) + L1d L#67 (32KB) + L1i L#67 (32KB) + Core L#67
        PU L#134 (P#67)
        PU L#135 (P#195)
      L2 L#68 (512KB) + L1d L#68 (32KB) + L1i L#68 (32KB) + Core L#68
        PU L#136 (P#68)
        PU L#137 (P#196)
      L2 L#69 (512KB) + L1d L#69 (32KB) + L1i L#69 (32KB) + Core L#69
        PU L#138 (P#69)
        PU L#139 (P#197)
      L2 L#70 (512KB) + L1d L#70 (32KB) + L1i L#70 (32KB) + Core L#70
        PU L#140 (P#70)
        PU L#141 (P#198)
      L2 L#71 (512KB) + L1d L#71 (32KB) + L1i L#71 (32KB) + Core L#71
        PU L#142 (P#71)
        PU L#143 (P#199)
    L3 L#9 (32MB)
      L2 L#72 (512KB) + L1d L#72 (32KB) + L1i L#72 (32KB) + Core L#72
        PU L#144 (P#72)
        PU L#145 (P#200)
      L2 L#73 (512KB) + L1d L#73 (32KB) + L1i L#73 (32KB) + Core L#73
        PU L#146 (P#73)
        PU L#147 (P#201)
      L2 L#74 (512KB) + L1d L#74 (32KB) + L1i L#74 (32KB) + Core L#74
        PU L#148 (P#74)
        PU L#149 (P#202)
      L2 L#75 (512KB) + L1d L#75 (32KB) + L1i L#75 (32KB) + Core L#75
        PU L#150 (P#75)
        PU L#151 (P#203)
      L2 L#76 (512KB) + L1d L#76 (32KB) + L1i L#76 (32KB) + Core L#76
        PU L#152 (P#76)
        PU L#153 (P#204)
      L2 L#77 (512KB) + L1d L#77 (32KB) + L1i L#77 (32KB) + Core L#77
        PU L#154 (P#77)
        PU L#155 (P#205)
      L2 L#78 (512KB) + L1d L#78 (32KB) + L1i L#78 (32KB) + Core L#78
        PU L#156 (P#78)
        PU L#157 (P#206)
      L2 L#79 (512KB) + L1d L#79 (32KB) + L1i L#79 (32KB) + Core L#79
        PU L#158 (P#79)
        PU L#159 (P#207)
    L3 L#10 (32MB)
      L2 L#80 (512KB) + L1d L#80 (32KB) + L1i L#80 (32KB) + Core L#80
        PU L#160 (P#80)
        PU L#161 (P#208)
      L2 L#81 (512KB) + L1d L#81 (32KB) + L1i L#81 (32KB) + Core L#81
        PU L#162 (P#81)
        PU L#163 (P#209)
      L2 L#82 (512KB) + L1d L#82 (32KB) + L1i L#82 (32KB) + Core L#82
        PU L#164 (P#82)
        PU L#165 (P#210)
      L2 L#83 (512KB) + L1d L#83 (32KB) + L1i L#83 (32KB) + Core L#83
        PU L#166 (P#83)
        PU L#167 (P#211)
      L2 L#84 (512KB) + L1d L#84 (32KB) + L1i L#84 (32KB) + Core L#84
        PU L#168 (P#84)
        PU L#169 (P#212)
      L2 L#85 (512KB) + L1d L#85 (32KB) + L1i L#85 (32KB) + Core L#85
        PU L#170 (P#85)
        PU L#171 (P#213)
      L2 L#86 (512KB) + L1d L#86 (32KB) + L1i L#86 (32KB) + Core L#86
        PU L#172 (P#86)
        PU L#173 (P#214)
      L2 L#87 (512KB) + L1d L#87 (32KB) + L1i L#87 (32KB) + Core L#87
        PU L#174 (P#87)
        PU L#175 (P#215)
    L3 L#11 (32MB)
      L2 L#88 (512KB) + L1d L#88 (32KB) + L1i L#88 (32KB) + Core L#88
        PU L#176 (P#88)
        PU L#177 (P#216)
      L2 L#89 (512KB) + L1d L#89 (32KB) + L1i L#89 (32KB) + Core L#89
        PU L#178 (P#89)
        PU L#179 (P#217)
      L2 L#90 (512KB) + L1d L#90 (32KB) + L1i L#90 (32KB) + Core L#90
        PU L#180 (P#90)
        PU L#181 (P#218)
      L2 L#91 (512KB) + L1d L#91 (32KB) + L1i L#91 (32KB) + Core L#91
        PU L#182 (P#91)
        PU L#183 (P#219)
      L2 L#92 (512KB) + L1d L#92 (32KB) + L1i L#92 (32KB) + Core L#92
        PU L#184 (P#92)
        PU L#185 (P#220)
      L2 L#93 (512KB) + L1d L#93 (32KB) + L1i L#93 (32KB) + Core L#93
        PU L#186 (P#93)
        PU L#187 (P#221)
      L2 L#94 (512KB) + L1d L#94 (32KB) + L1i L#94 (32KB) + Core L#94
        PU L#188 (P#94)
        PU L#189 (P#222)
      L2 L#95 (512KB) + L1d L#95 (32KB) + L1i L#95 (32KB) + Core L#95
        PU L#190 (P#95)
        PU L#191 (P#223)
    L3 L#12 (32MB)
      L2 L#96 (512KB) + L1d L#96 (32KB) + L1i L#96 (32KB) + Core L#96
        PU L#192 (P#96)
        PU L#193 (P#224)
      L2 L#97 (512KB) + L1d L#97 (32KB) + L1i L#97 (32KB) + Core L#97
        PU L#194 (P#97)
        PU L#195 (P#225)
      L2 L#98 (512KB) + L1d L#98 (32KB) + L1i L#98 (32KB) + Core L#98
        PU L#196 (P#98)
        PU L#197 (P#226)
      L2 L#99 (512KB) + L1d L#99 (32KB) + L1i L#99 (32KB) + Core L#99
        PU L#198 (P#99)
        PU L#199 (P#227)
      L2 L#100 (512KB) + L1d L#100 (32KB) + L1i L#100 (32KB) + Core L#100
        PU L#200 (P#100)
        PU L#201 (P#228)
      L2 L#101 (512KB) + L1d L#101 (32KB) + L1i L#101 (32KB) + Core L#101
        PU L#202 (P#101)
        PU L#203 (P#229)
      L2 L#102 (512KB) + L1d L#102 (32KB) + L1i L#102 (32KB) + Core L#102
        PU L#204 (P#102)
        PU L#205 (P#230)
      L2 L#103 (512KB) + L1d L#103 (32KB) + L1i L#103 (32KB) + Core L#103
        PU L#206 (P#103)
        PU L#207 (P#231)
    L3 L#13 (32MB)
      L2 L#104 (512KB) + L1d L#104 (32KB) + L1i L#104 (32KB) + Core L#104
        PU L#208 (P#104)
        PU L#209 (P#232)
      L2 L#105 (512KB) + L1d L#105 (32KB) + L1i L#105 (32KB) + Core L#105
        PU L#210 (P#105)
        PU L#211 (P#233)
      L2 L#106 (512KB) + L1d L#106 (32KB) + L1i L#106 (32KB) + Core L#106
        PU L#212 (P#106)
        PU L#213 (P#234)
      L2 L#107 (512KB) + L1d L#107 (32KB) + L1i L#107 (32KB) + Core L#107
        PU L#214 (P#107)
        PU L#215 (P#235)
      L2 L#108 (512KB) + L1d L#108 (32KB) + L1i L#108 (32KB) + Core L#108
        PU L#216 (P#108)
        PU L#217 (P#236)
      L2 L#109 (512KB) + L1d L#109 (32KB) + L1i L#109 (32KB) + Core L#109
        PU L#218 (P#109)
        PU L#219 (P#237)
      L2 L#110 (512KB) + L1d L#110 (32KB) + L1i L#110 (32KB) + Core L#110
        PU L#220 (P#110)
        PU L#221 (P#238)
      L2 L#111 (512KB) + L1d L#111 (32KB) + L1i L#111 (32KB) + Core L#111
        PU L#222 (P#111)
        PU L#223 (P#239)
    L3 L#14 (32MB)
      L2 L#112 (512KB) + L1d L#112 (32KB) + L1i L#112 (32KB) + Core L#112
        PU L#224 (P#112)
        PU L#225 (P#240)
      L2 L#113 (512KB) + L1d L#113 (32KB) + L1i L#113 (32KB) + Core L#113
        PU L#226 (P#113)
        PU L#227 (P#241)
      L2 L#114 (512KB) + L1d L#114 (32KB) + L1i L#114 (32KB) + Core L#114
        PU L#228 (P#114)
        PU L#229 (P#242)
      L2 L#115 (512KB) + L1d L#115 (32KB) + L1i L#115 (32KB) + Core L#115
        PU L#230 (P#115)
        PU L#231 (P#243)
      L2 L#116 (512KB) + L1d L#116 (32KB) + L1i L#116 (32KB) + Core L#116
        PU L#232 (P#116)
        PU L#233 (P#244)
      L2 L#117 (512KB) + L1d L#117 (32KB) + L1i L#117 (32KB) + Core L#117
        PU L#234 (P#117)
        PU L#235 (P#245)
      L2 L#118 (512KB) + L1d L#118 (32KB) + L1i L#118 (32KB) + Core L#118
        PU L#236 (P#118)
        PU L#237 (P#246)
      L2 L#119 (512KB) + L1d L#119 (32KB) + L1i L#119 (32KB) + Core L#119
        PU L#238 (P#119)
        PU L#239 (P#247)
    L3 L#15 (32MB)
      L2 L#120 (512KB) + L1d L#120 (32KB) + L1i L#120 (32KB) + Core L#120
        PU L#240 (P#120)
        PU L#241 (P#248)
      L2 L#121 (512KB) + L1d L#121 (32KB) + L1i L#121 (32KB) + Core L#121
        PU L#242 (P#121)
        PU L#243 (P#249)
      L2 L#122 (512KB) + L1d L#122 (32KB) + L1i L#122 (32KB) + Core L#122
        PU L#244 (P#122)
        PU L#245 (P#250)
      L2 L#123 (512KB) + L1d L#123 (32KB) + L1i L#123 (32KB) + Core L#123
        PU L#246 (P#123)
        PU L#247 (P#251)
      L2 L#124 (512KB) + L1d L#124 (32KB) + L1i L#124 (32KB) + Core L#124
        PU L#248 (P#124)
        PU L#249 (P#252)
      L2 L#125 (512KB) + L1d L#125 (32KB) + L1i L#125 (32KB) + Core L#125
        PU L#250 (P#125)
        PU L#251 (P#253)
      L2 L#126 (512KB) + L1d L#126 (32KB) + L1i L#126 (32KB) + Core L#126
        PU L#252 (P#126)
        PU L#253 (P#254)
      L2 L#127 (512KB) + L1d L#127 (32KB) + L1i L#127 (32KB) + Core L#127
        PU L#254 (P#127)
        PU L#255 (P#255)
    HostBridge
      PCIBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCIBridge
                PCIBridge
                  PCIBridge
                    PCI 8e:00.0 (Display)
          PCIBridge
            PCIBridge
              PCIBridge
                PCIBridge
                  PCIBridge
                    PCI 93:00.0 (Display)
          PCIBridge
            PCI 99:00.0 (Storage)
    HostBridge
      PCIBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCIBridge
                PCI ab:00.0 (InfiniBand)
                  Net "ib0"
                  OpenFabrics "mlx5_0"
                PCI ab:00.1 (InfiniBand)
                  Net "ib1"
                  OpenFabrics "mlx5_1"
              PCIBridge
                PCIBridge
                  PCIBridge
                    PCI ae:00.0 (Display)
          PCIBridge
            PCIBridge
              PCIBridge
                PCIBridge
                  PCIBridge
                    PCI b3:00.0 (Display)
          PCIBridge
            PCI b9:00.0 (Storage)
    HostBridge
      PCIBridge
        PCI c3:00.0 (SATA)
      PCIBridge
        PCI c4:00.0 (SATA)
hzhou commented 1 month ago

We call hwloc_set_cpubind to bind the cpus, but it appears we did not check the return code. So it seems somehow the function may fail -- which result in no binding 0-255. I have no clue so far. Will investigate.

hzhou commented 1 month ago

I was not able to reproduce it yet:

zhouh@amdgpu04:~/pull_requests/mpich-main> HYDRA_TOPO_DEBUG=1 mpirun -n 2 -bind-to user:0-15,16-31 ./t
[proxy:0@amdgpu04] created hwloc xml file /tmp/hydra_hwloc_xmlfile_Rvrhg6
process 0 binding: 1111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
process 1 binding: 0000000000000000111111111111111100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
To affinity and beyond!! nname= amdgpu04  rnk= 0  tid= 0: list_cores= (0-15)

To affinity and beyond!! nname= amdgpu04  rnk= 1  tid= 0: list_cores= (16-31)

Anyway, this patch (https://github.com/pmodels/mpich/pull/7069/files) adds an error check for hwloc_set_cpubind. Give it try and see if it prints out any error message when you reproduce the issue.

colleeneb commented 1 month ago

Oh interesting! I will retest and try out the patch, thanks!

colleeneb commented 1 month ago

I tried rebuilding with:

git clone -b 2407_hydra_bind git@github.com:hzhou/mpich.git                                                                                                                                                          
cd mpich
git submodule update --init                                                                                                                                                                                          
./autogen.sh                                                                                                                                                                                                         
./configure --disable-option-checking --prefix=$HOME/install --with-hwloc=embedded --with-hip=/soft/compilers/rocm/rocm-6.1.0 --with-device=ch4:ucx --with-rocm=/soft/compilers/rocm/rocm-6.1.0 --with-ucx=embedded --cache-file=/dev/null CC=gcc  

And I do see the hwloc_set_cpubind fail:

> HYDRA_TOPO_DEBUG=1 mpirun -n 2 -bind-to user:0-15,16-31 ./a.out
[proxy:0@amdgpu04] created hwloc xml file /tmp/hydra_hwloc_xmlfile_j8xay8
process 0 binding: 11111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000
process 1 binding: 00000000000000001111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000
[proxy:0@amdgpu04] HYDT_topo_hwloc_bind (lib/tools/topo/hwloc/topo_hwloc.c:699): hwloc_set_cpubind failed, rc = 0
[proxy:0@amdgpu04] HYDT_topo_bind (lib/tools/topo/topo.c:102): HWLOC failure binding process to core
[proxy:0@amdgpu04] HYDU_create_process (lib/utils/launch.c:66): bind process failed
[proxy:0@amdgpu04] launch_procs (proxy/pmip_cb.c:1008): create process returned error
[proxy:0@amdgpu04] handle_launch_procs (proxy/pmip_cb.c:588): launch_procs returned error
[proxy:0@amdgpu04] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:498): launch_procs returned error
[proxy:0@amdgpu04] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0@amdgpu04] main (proxy/pmip.c:122): demux engine error waiting for event
[proxy:0@amdgpu04] HYDT_topo_hwloc_bind (lib/tools/topo/hwloc/topo_hwloc.c:699): hwloc_set_cpubind failed, rc = 0
[proxy:0@amdgpu04] HYDT_topo_bind (lib/tools/topo/topo.c:102): HWLOC failure binding process to core
[proxy:0@amdgpu04] HYDU_create_process (lib/utils/launch.c:66): bind process failed
[proxy:0@amdgpu04] launch_procs (proxy/pmip_cb.c:1008): create process returned error
[proxy:0@amdgpu04] handle_launch_procs (proxy/pmip_cb.c:588): launch_procs returned error
[proxy:0@amdgpu04] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:498): launch_procs returned error
[proxy:0@amdgpu04] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0@amdgpu04] main (proxy/pmip.c:122): demux engine error waiting for event

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 217154 RUNNING AT amdgpu04
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0@amdgpu04] removed file /tmp/hydra_hwloc_xmlfile_j8xay8
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Since it wasn't failing for you, did I do something wrong in the build configure line maybe? For example, I was using --with-hwloc=embedded but maybe that's not something to use in this case? Thanks a lot for the help!

hzhou commented 1 month ago
[proxy:0@amdgpu04] HYDT_topo_hwloc_bind (lib/tools/topo/hwloc/topo_hwloc.c:699): hwloc_set_cpubind failed, rc = 0

Oops! I think I reversed the condition. rc = 0 is supposed to mean success ;) Could you rebuild and try it again? If you manually change the file src/pm/hydra/lib/tools/topo/hwloc/topo_hwloc.c: (https://github.com/pmodels/mpich/pull/7069/files), you may skip autogen and configure. It'll be a quick rebuild.

colleeneb commented 1 month ago

Thanks! I rebuilt MPICH and it doesn't crash but I still see the affinity looking strange unfortunately:

> HYDRA_TOPO_DEBUG=1 mpirun -n 2 -bind-to user:0-15,16-31 ./a.out
[proxy:0@amdgpu04] created hwloc xml file /tmp/hydra_hwloc_xmlfile_28OnOB
process 0 binding: 1111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
process 1 binding: 0000000000000000111111111111111100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
To affinity and beyond!! nname= amdgpu04  rnk= 0  tid= 0: list_cores= (0-255)

To affinity and beyond!! nname= amdgpu04  rnk= 1  tid= 0: list_cores= (0-255)

[proxy:0@amdgpu04] removed file /tmp/hydra_hwloc_xmlfile_28OnOB