Closed colleeneb closed 1 week ago
What is the output from lstopo
(a utility from hwloc
)?
Thanks! lstopo is
> lstopo
Machine (502GB total)
Package L#0
NUMANode L#0 (P#0 251GB)
L3 L#0 (32MB)
L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#128)
L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#129)
L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#130)
L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#131)
L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#132)
L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#133)
L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#134)
L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#135)
L3 L#1 (32MB)
L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#136)
L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#137)
L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#138)
L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#139)
L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#140)
L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#141)
L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#142)
L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#143)
L3 L#2 (32MB)
L2 L#16 (512KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
PU L#32 (P#16)
PU L#33 (P#144)
L2 L#17 (512KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
PU L#34 (P#17)
PU L#35 (P#145)
L2 L#18 (512KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
PU L#36 (P#18)
PU L#37 (P#146)
L2 L#19 (512KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
PU L#38 (P#19)
PU L#39 (P#147)
L2 L#20 (512KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
PU L#40 (P#20)
PU L#41 (P#148)
L2 L#21 (512KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
PU L#42 (P#21)
PU L#43 (P#149)
L2 L#22 (512KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
PU L#44 (P#22)
PU L#45 (P#150)
L2 L#23 (512KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
PU L#46 (P#23)
PU L#47 (P#151)
L3 L#3 (32MB)
L2 L#24 (512KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
PU L#48 (P#24)
PU L#49 (P#152)
L2 L#25 (512KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
PU L#50 (P#25)
PU L#51 (P#153)
L2 L#26 (512KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
PU L#52 (P#26)
PU L#53 (P#154)
L2 L#27 (512KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
PU L#54 (P#27)
PU L#55 (P#155)
L2 L#28 (512KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
PU L#56 (P#28)
PU L#57 (P#156)
L2 L#29 (512KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
PU L#58 (P#29)
PU L#59 (P#157)
L2 L#30 (512KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
PU L#60 (P#30)
PU L#61 (P#158)
L2 L#31 (512KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
PU L#62 (P#31)
PU L#63 (P#159)
L3 L#4 (32MB)
L2 L#32 (512KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
PU L#64 (P#32)
PU L#65 (P#160)
L2 L#33 (512KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
PU L#66 (P#33)
PU L#67 (P#161)
L2 L#34 (512KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
PU L#68 (P#34)
PU L#69 (P#162)
L2 L#35 (512KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
PU L#70 (P#35)
PU L#71 (P#163)
L2 L#36 (512KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36
PU L#72 (P#36)
PU L#73 (P#164)
L2 L#37 (512KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37
PU L#74 (P#37)
PU L#75 (P#165)
L2 L#38 (512KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
PU L#76 (P#38)
PU L#77 (P#166)
L2 L#39 (512KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
PU L#78 (P#39)
PU L#79 (P#167)
L3 L#5 (32MB)
L2 L#40 (512KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40
PU L#80 (P#40)
PU L#81 (P#168)
L2 L#41 (512KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41
PU L#82 (P#41)
PU L#83 (P#169)
L2 L#42 (512KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42
PU L#84 (P#42)
PU L#85 (P#170)
L2 L#43 (512KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43
PU L#86 (P#43)
PU L#87 (P#171)
L2 L#44 (512KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44
PU L#88 (P#44)
PU L#89 (P#172)
L2 L#45 (512KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45
PU L#90 (P#45)
PU L#91 (P#173)
L2 L#46 (512KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46
PU L#92 (P#46)
PU L#93 (P#174)
L2 L#47 (512KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47
PU L#94 (P#47)
PU L#95 (P#175)
L3 L#6 (32MB)
L2 L#48 (512KB) + L1d L#48 (32KB) + L1i L#48 (32KB) + Core L#48
PU L#96 (P#48)
PU L#97 (P#176)
L2 L#49 (512KB) + L1d L#49 (32KB) + L1i L#49 (32KB) + Core L#49
PU L#98 (P#49)
PU L#99 (P#177)
L2 L#50 (512KB) + L1d L#50 (32KB) + L1i L#50 (32KB) + Core L#50
PU L#100 (P#50)
PU L#101 (P#178)
L2 L#51 (512KB) + L1d L#51 (32KB) + L1i L#51 (32KB) + Core L#51
PU L#102 (P#51)
PU L#103 (P#179)
L2 L#52 (512KB) + L1d L#52 (32KB) + L1i L#52 (32KB) + Core L#52
PU L#104 (P#52)
PU L#105 (P#180)
L2 L#53 (512KB) + L1d L#53 (32KB) + L1i L#53 (32KB) + Core L#53
PU L#106 (P#53)
PU L#107 (P#181)
L2 L#54 (512KB) + L1d L#54 (32KB) + L1i L#54 (32KB) + Core L#54
PU L#108 (P#54)
PU L#109 (P#182)
L2 L#55 (512KB) + L1d L#55 (32KB) + L1i L#55 (32KB) + Core L#55
PU L#110 (P#55)
PU L#111 (P#183)
L3 L#7 (32MB)
L2 L#56 (512KB) + L1d L#56 (32KB) + L1i L#56 (32KB) + Core L#56
PU L#112 (P#56)
PU L#113 (P#184)
L2 L#57 (512KB) + L1d L#57 (32KB) + L1i L#57 (32KB) + Core L#57
PU L#114 (P#57)
PU L#115 (P#185)
L2 L#58 (512KB) + L1d L#58 (32KB) + L1i L#58 (32KB) + Core L#58
PU L#116 (P#58)
PU L#117 (P#186)
L2 L#59 (512KB) + L1d L#59 (32KB) + L1i L#59 (32KB) + Core L#59
PU L#118 (P#59)
PU L#119 (P#187)
L2 L#60 (512KB) + L1d L#60 (32KB) + L1i L#60 (32KB) + Core L#60
PU L#120 (P#60)
PU L#121 (P#188)
L2 L#61 (512KB) + L1d L#61 (32KB) + L1i L#61 (32KB) + Core L#61
PU L#122 (P#61)
PU L#123 (P#189)
L2 L#62 (512KB) + L1d L#62 (32KB) + L1i L#62 (32KB) + Core L#62
PU L#124 (P#62)
PU L#125 (P#190)
L2 L#63 (512KB) + L1d L#63 (32KB) + L1i L#63 (32KB) + Core L#63
PU L#126 (P#63)
PU L#127 (P#191)
HostBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI 05:00.0 (Ethernet)
Net "eth0"
PCI 05:00.1 (Ethernet)
Net "eth1"
PCIBridge
PCI 06:00.0 (NVMExp)
Block(Disk) "nvme0n1"
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI 11:00.0 (Display)
PCIBridge
PCIBridge
PCIBridge
PCI 14:00.0 (Display)
PCIBridge
PCI 19:00.0 (Storage)
HostBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI 31:00.0 (Display)
PCIBridge
PCIBridge
PCIBridge
PCI 34:00.0 (Display)
PCIBridge
PCI 39:00.0 (Storage)
HostBridge
PCIBridge
PCI 43:00.0 (SATA)
HostBridge
PCIBridge
PCIBridge
PCI 62:00.0 (VGA)
Package L#1
NUMANode L#1 (P#1 251GB)
L3 L#8 (32MB)
L2 L#64 (512KB) + L1d L#64 (32KB) + L1i L#64 (32KB) + Core L#64
PU L#128 (P#64)
PU L#129 (P#192)
L2 L#65 (512KB) + L1d L#65 (32KB) + L1i L#65 (32KB) + Core L#65
PU L#130 (P#65)
PU L#131 (P#193)
L2 L#66 (512KB) + L1d L#66 (32KB) + L1i L#66 (32KB) + Core L#66
PU L#132 (P#66)
PU L#133 (P#194)
L2 L#67 (512KB) + L1d L#67 (32KB) + L1i L#67 (32KB) + Core L#67
PU L#134 (P#67)
PU L#135 (P#195)
L2 L#68 (512KB) + L1d L#68 (32KB) + L1i L#68 (32KB) + Core L#68
PU L#136 (P#68)
PU L#137 (P#196)
L2 L#69 (512KB) + L1d L#69 (32KB) + L1i L#69 (32KB) + Core L#69
PU L#138 (P#69)
PU L#139 (P#197)
L2 L#70 (512KB) + L1d L#70 (32KB) + L1i L#70 (32KB) + Core L#70
PU L#140 (P#70)
PU L#141 (P#198)
L2 L#71 (512KB) + L1d L#71 (32KB) + L1i L#71 (32KB) + Core L#71
PU L#142 (P#71)
PU L#143 (P#199)
L3 L#9 (32MB)
L2 L#72 (512KB) + L1d L#72 (32KB) + L1i L#72 (32KB) + Core L#72
PU L#144 (P#72)
PU L#145 (P#200)
L2 L#73 (512KB) + L1d L#73 (32KB) + L1i L#73 (32KB) + Core L#73
PU L#146 (P#73)
PU L#147 (P#201)
L2 L#74 (512KB) + L1d L#74 (32KB) + L1i L#74 (32KB) + Core L#74
PU L#148 (P#74)
PU L#149 (P#202)
L2 L#75 (512KB) + L1d L#75 (32KB) + L1i L#75 (32KB) + Core L#75
PU L#150 (P#75)
PU L#151 (P#203)
L2 L#76 (512KB) + L1d L#76 (32KB) + L1i L#76 (32KB) + Core L#76
PU L#152 (P#76)
PU L#153 (P#204)
L2 L#77 (512KB) + L1d L#77 (32KB) + L1i L#77 (32KB) + Core L#77
PU L#154 (P#77)
PU L#155 (P#205)
L2 L#78 (512KB) + L1d L#78 (32KB) + L1i L#78 (32KB) + Core L#78
PU L#156 (P#78)
PU L#157 (P#206)
L2 L#79 (512KB) + L1d L#79 (32KB) + L1i L#79 (32KB) + Core L#79
PU L#158 (P#79)
PU L#159 (P#207)
L3 L#10 (32MB)
L2 L#80 (512KB) + L1d L#80 (32KB) + L1i L#80 (32KB) + Core L#80
PU L#160 (P#80)
PU L#161 (P#208)
L2 L#81 (512KB) + L1d L#81 (32KB) + L1i L#81 (32KB) + Core L#81
PU L#162 (P#81)
PU L#163 (P#209)
L2 L#82 (512KB) + L1d L#82 (32KB) + L1i L#82 (32KB) + Core L#82
PU L#164 (P#82)
PU L#165 (P#210)
L2 L#83 (512KB) + L1d L#83 (32KB) + L1i L#83 (32KB) + Core L#83
PU L#166 (P#83)
PU L#167 (P#211)
L2 L#84 (512KB) + L1d L#84 (32KB) + L1i L#84 (32KB) + Core L#84
PU L#168 (P#84)
PU L#169 (P#212)
L2 L#85 (512KB) + L1d L#85 (32KB) + L1i L#85 (32KB) + Core L#85
PU L#170 (P#85)
PU L#171 (P#213)
L2 L#86 (512KB) + L1d L#86 (32KB) + L1i L#86 (32KB) + Core L#86
PU L#172 (P#86)
PU L#173 (P#214)
L2 L#87 (512KB) + L1d L#87 (32KB) + L1i L#87 (32KB) + Core L#87
PU L#174 (P#87)
PU L#175 (P#215)
L3 L#11 (32MB)
L2 L#88 (512KB) + L1d L#88 (32KB) + L1i L#88 (32KB) + Core L#88
PU L#176 (P#88)
PU L#177 (P#216)
L2 L#89 (512KB) + L1d L#89 (32KB) + L1i L#89 (32KB) + Core L#89
PU L#178 (P#89)
PU L#179 (P#217)
L2 L#90 (512KB) + L1d L#90 (32KB) + L1i L#90 (32KB) + Core L#90
PU L#180 (P#90)
PU L#181 (P#218)
L2 L#91 (512KB) + L1d L#91 (32KB) + L1i L#91 (32KB) + Core L#91
PU L#182 (P#91)
PU L#183 (P#219)
L2 L#92 (512KB) + L1d L#92 (32KB) + L1i L#92 (32KB) + Core L#92
PU L#184 (P#92)
PU L#185 (P#220)
L2 L#93 (512KB) + L1d L#93 (32KB) + L1i L#93 (32KB) + Core L#93
PU L#186 (P#93)
PU L#187 (P#221)
L2 L#94 (512KB) + L1d L#94 (32KB) + L1i L#94 (32KB) + Core L#94
PU L#188 (P#94)
PU L#189 (P#222)
L2 L#95 (512KB) + L1d L#95 (32KB) + L1i L#95 (32KB) + Core L#95
PU L#190 (P#95)
PU L#191 (P#223)
L3 L#12 (32MB)
L2 L#96 (512KB) + L1d L#96 (32KB) + L1i L#96 (32KB) + Core L#96
PU L#192 (P#96)
PU L#193 (P#224)
L2 L#97 (512KB) + L1d L#97 (32KB) + L1i L#97 (32KB) + Core L#97
PU L#194 (P#97)
PU L#195 (P#225)
L2 L#98 (512KB) + L1d L#98 (32KB) + L1i L#98 (32KB) + Core L#98
PU L#196 (P#98)
PU L#197 (P#226)
L2 L#99 (512KB) + L1d L#99 (32KB) + L1i L#99 (32KB) + Core L#99
PU L#198 (P#99)
PU L#199 (P#227)
L2 L#100 (512KB) + L1d L#100 (32KB) + L1i L#100 (32KB) + Core L#100
PU L#200 (P#100)
PU L#201 (P#228)
L2 L#101 (512KB) + L1d L#101 (32KB) + L1i L#101 (32KB) + Core L#101
PU L#202 (P#101)
PU L#203 (P#229)
L2 L#102 (512KB) + L1d L#102 (32KB) + L1i L#102 (32KB) + Core L#102
PU L#204 (P#102)
PU L#205 (P#230)
L2 L#103 (512KB) + L1d L#103 (32KB) + L1i L#103 (32KB) + Core L#103
PU L#206 (P#103)
PU L#207 (P#231)
L3 L#13 (32MB)
L2 L#104 (512KB) + L1d L#104 (32KB) + L1i L#104 (32KB) + Core L#104
PU L#208 (P#104)
PU L#209 (P#232)
L2 L#105 (512KB) + L1d L#105 (32KB) + L1i L#105 (32KB) + Core L#105
PU L#210 (P#105)
PU L#211 (P#233)
L2 L#106 (512KB) + L1d L#106 (32KB) + L1i L#106 (32KB) + Core L#106
PU L#212 (P#106)
PU L#213 (P#234)
L2 L#107 (512KB) + L1d L#107 (32KB) + L1i L#107 (32KB) + Core L#107
PU L#214 (P#107)
PU L#215 (P#235)
L2 L#108 (512KB) + L1d L#108 (32KB) + L1i L#108 (32KB) + Core L#108
PU L#216 (P#108)
PU L#217 (P#236)
L2 L#109 (512KB) + L1d L#109 (32KB) + L1i L#109 (32KB) + Core L#109
PU L#218 (P#109)
PU L#219 (P#237)
L2 L#110 (512KB) + L1d L#110 (32KB) + L1i L#110 (32KB) + Core L#110
PU L#220 (P#110)
PU L#221 (P#238)
L2 L#111 (512KB) + L1d L#111 (32KB) + L1i L#111 (32KB) + Core L#111
PU L#222 (P#111)
PU L#223 (P#239)
L3 L#14 (32MB)
L2 L#112 (512KB) + L1d L#112 (32KB) + L1i L#112 (32KB) + Core L#112
PU L#224 (P#112)
PU L#225 (P#240)
L2 L#113 (512KB) + L1d L#113 (32KB) + L1i L#113 (32KB) + Core L#113
PU L#226 (P#113)
PU L#227 (P#241)
L2 L#114 (512KB) + L1d L#114 (32KB) + L1i L#114 (32KB) + Core L#114
PU L#228 (P#114)
PU L#229 (P#242)
L2 L#115 (512KB) + L1d L#115 (32KB) + L1i L#115 (32KB) + Core L#115
PU L#230 (P#115)
PU L#231 (P#243)
L2 L#116 (512KB) + L1d L#116 (32KB) + L1i L#116 (32KB) + Core L#116
PU L#232 (P#116)
PU L#233 (P#244)
L2 L#117 (512KB) + L1d L#117 (32KB) + L1i L#117 (32KB) + Core L#117
PU L#234 (P#117)
PU L#235 (P#245)
L2 L#118 (512KB) + L1d L#118 (32KB) + L1i L#118 (32KB) + Core L#118
PU L#236 (P#118)
PU L#237 (P#246)
L2 L#119 (512KB) + L1d L#119 (32KB) + L1i L#119 (32KB) + Core L#119
PU L#238 (P#119)
PU L#239 (P#247)
L3 L#15 (32MB)
L2 L#120 (512KB) + L1d L#120 (32KB) + L1i L#120 (32KB) + Core L#120
PU L#240 (P#120)
PU L#241 (P#248)
L2 L#121 (512KB) + L1d L#121 (32KB) + L1i L#121 (32KB) + Core L#121
PU L#242 (P#121)
PU L#243 (P#249)
L2 L#122 (512KB) + L1d L#122 (32KB) + L1i L#122 (32KB) + Core L#122
PU L#244 (P#122)
PU L#245 (P#250)
L2 L#123 (512KB) + L1d L#123 (32KB) + L1i L#123 (32KB) + Core L#123
PU L#246 (P#123)
PU L#247 (P#251)
L2 L#124 (512KB) + L1d L#124 (32KB) + L1i L#124 (32KB) + Core L#124
PU L#248 (P#124)
PU L#249 (P#252)
L2 L#125 (512KB) + L1d L#125 (32KB) + L1i L#125 (32KB) + Core L#125
PU L#250 (P#125)
PU L#251 (P#253)
L2 L#126 (512KB) + L1d L#126 (32KB) + L1i L#126 (32KB) + Core L#126
PU L#252 (P#126)
PU L#253 (P#254)
L2 L#127 (512KB) + L1d L#127 (32KB) + L1i L#127 (32KB) + Core L#127
PU L#254 (P#127)
PU L#255 (P#255)
HostBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI 8e:00.0 (Display)
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI 93:00.0 (Display)
PCIBridge
PCI 99:00.0 (Storage)
HostBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI ab:00.0 (InfiniBand)
Net "ib0"
OpenFabrics "mlx5_0"
PCI ab:00.1 (InfiniBand)
Net "ib1"
OpenFabrics "mlx5_1"
PCIBridge
PCIBridge
PCIBridge
PCI ae:00.0 (Display)
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI b3:00.0 (Display)
PCIBridge
PCI b9:00.0 (Storage)
HostBridge
PCIBridge
PCI c3:00.0 (SATA)
PCIBridge
PCI c4:00.0 (SATA)
We call hwloc_set_cpubind
to bind the cpus, but it appears we did not check the return code. So it seems somehow the function may fail -- which result in no binding 0-255
. I have no clue so far. Will investigate.
I was not able to reproduce it yet:
zhouh@amdgpu04:~/pull_requests/mpich-main> HYDRA_TOPO_DEBUG=1 mpirun -n 2 -bind-to user:0-15,16-31 ./t
[proxy:0@amdgpu04] created hwloc xml file /tmp/hydra_hwloc_xmlfile_Rvrhg6
process 0 binding: 1111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
process 1 binding: 0000000000000000111111111111111100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
To affinity and beyond!! nname= amdgpu04 rnk= 0 tid= 0: list_cores= (0-15)
To affinity and beyond!! nname= amdgpu04 rnk= 1 tid= 0: list_cores= (16-31)
Anyway, this patch (https://github.com/pmodels/mpich/pull/7069/files) adds an error check for hwloc_set_cpubind
. Give it try and see if it prints out any error message when you reproduce the issue.
Oh interesting! I will retest and try out the patch, thanks!
I tried rebuilding with:
git clone -b 2407_hydra_bind git@github.com:hzhou/mpich.git
cd mpich
git submodule update --init
./autogen.sh
./configure --disable-option-checking --prefix=$HOME/install --with-hwloc=embedded --with-hip=/soft/compilers/rocm/rocm-6.1.0 --with-device=ch4:ucx --with-rocm=/soft/compilers/rocm/rocm-6.1.0 --with-ucx=embedded --cache-file=/dev/null CC=gcc
And I do see the hwloc_set_cpubind fail:
> HYDRA_TOPO_DEBUG=1 mpirun -n 2 -bind-to user:0-15,16-31 ./a.out
[proxy:0@amdgpu04] created hwloc xml file /tmp/hydra_hwloc_xmlfile_j8xay8
process 0 binding: 11111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000
process 1 binding: 00000000000000001111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000
[proxy:0@amdgpu04] HYDT_topo_hwloc_bind (lib/tools/topo/hwloc/topo_hwloc.c:699): hwloc_set_cpubind failed, rc = 0
[proxy:0@amdgpu04] HYDT_topo_bind (lib/tools/topo/topo.c:102): HWLOC failure binding process to core
[proxy:0@amdgpu04] HYDU_create_process (lib/utils/launch.c:66): bind process failed
[proxy:0@amdgpu04] launch_procs (proxy/pmip_cb.c:1008): create process returned error
[proxy:0@amdgpu04] handle_launch_procs (proxy/pmip_cb.c:588): launch_procs returned error
[proxy:0@amdgpu04] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:498): launch_procs returned error
[proxy:0@amdgpu04] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0@amdgpu04] main (proxy/pmip.c:122): demux engine error waiting for event
[proxy:0@amdgpu04] HYDT_topo_hwloc_bind (lib/tools/topo/hwloc/topo_hwloc.c:699): hwloc_set_cpubind failed, rc = 0
[proxy:0@amdgpu04] HYDT_topo_bind (lib/tools/topo/topo.c:102): HWLOC failure binding process to core
[proxy:0@amdgpu04] HYDU_create_process (lib/utils/launch.c:66): bind process failed
[proxy:0@amdgpu04] launch_procs (proxy/pmip_cb.c:1008): create process returned error
[proxy:0@amdgpu04] handle_launch_procs (proxy/pmip_cb.c:588): launch_procs returned error
[proxy:0@amdgpu04] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:498): launch_procs returned error
[proxy:0@amdgpu04] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0@amdgpu04] main (proxy/pmip.c:122): demux engine error waiting for event
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 217154 RUNNING AT amdgpu04
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0@amdgpu04] removed file /tmp/hydra_hwloc_xmlfile_j8xay8
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Since it wasn't failing for you, did I do something wrong in the build configure line maybe? For example, I was using --with-hwloc=embedded
but maybe that's not something to use in this case? Thanks a lot for the help!
[proxy:0@amdgpu04] HYDT_topo_hwloc_bind (lib/tools/topo/hwloc/topo_hwloc.c:699): hwloc_set_cpubind failed, rc = 0
Oops! I think I reversed the condition. rc = 0
is supposed to mean success ;) Could you rebuild and try it again? If you manually change the file src/pm/hydra/lib/tools/topo/hwloc/topo_hwloc.c:
(https://github.com/pmodels/mpich/pull/7069/files), you may skip autogen and configure. It'll be a quick rebuild.
Thanks! I rebuilt MPICH and it doesn't crash but I still see the affinity looking strange unfortunately:
> HYDRA_TOPO_DEBUG=1 mpirun -n 2 -bind-to user:0-15,16-31 ./a.out
[proxy:0@amdgpu04] created hwloc xml file /tmp/hydra_hwloc_xmlfile_28OnOB
process 0 binding: 1111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
process 1 binding: 0000000000000000111111111111111100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
To affinity and beyond!! nname= amdgpu04 rnk= 0 tid= 0: list_cores= (0-255)
To affinity and beyond!! nname= amdgpu04 rnk= 1 tid= 0: list_cores= (0-255)
[proxy:0@amdgpu04] removed file /tmp/hydra_hwloc_xmlfile_28OnOB
Summary
This is to report a possible issue / unexpected behavior with AMD-GPU-aware MPICH on an Supermicro AS-4124GQ-TNMI (2x AMD EPYC 7713 with 4 AMD Instinct MI250). @raffenet kindly built MPICH there and it works fine but we noticed some odd behavior with the cpu-binding. I was trying to get each rank to bind in a consecutive manner to the 16 cores, i.e. rank 0 to cores 0 to 15 and rank 1 to cores 16-31. From my understanding this should be
mpirun -n 2 -bind-to user:0-15,16-31
. When I tried this however, the output of HYDRA_TOPO_DEBUG looks correct but with a code we’ve used before to check affinity (https://github.com/argonne-lcf/GettingStarted/blob/master/Examples/Theta/affinity/main.cpp) it acts like it's not binding correctly, and it changes when we run multiple times. We use the affinity code at ALCF for many systems, so it's unlikely (but always possible!) that there is an issue with it. I was able to confirm withhtop
that the ranks weren't running on their set of cores as well.Reproducer
(the module is specific to our system but it loads MPICH with AMD GPU support)
Expected Output
We expect the affinity code to print
list_cores= (0-15)
for rank 0 andlist_cores= (16-31)
for rank 1, like:Actual Output
As we can see below, the first run,
list_cores= (0-15)
for rank 0 butlist_cores= (0-255)
for rank 1. For the second run,list_cores= (0-255)
for rank 0 andlist_cores= (16-31)
for rank 1.