Closed asaiacai closed 1 month ago
Hey - sorry for the late reply!
Happy to share, but just remember we're on a pure IB fabric, i.e. no RoCE or IPoIB, so YMMV.
First, let's just take a look at all the Connect-X cards on our hosts, with some extra comments. Note that up/down here is slightly misleading, since it's about the IP / ethernet side of things.
$ ibdev2netdev | column -t
mlx5_0 port 1 ==> ibp26s0 (Down) # CX-7 1/8 infiniband
mlx5_1 port 1 ==> enp27s0f0np0 (Up) # CX-6 2/2 *ethernet* port 1/2
mlx5_10 port 1 ==> ibp204s0 (Down) # CX-7 7/8 infiniband
mlx5_11 port 1 ==> ibp220s0 (Down) # CX-7 8/8 infiniband
mlx5_2 port 1 ==> enp27s0f1np1 (Down) # CX-6 2/2 *ethernet* port 2/2
mlx5_3 port 1 ==> ibp60s0 (Down) # CX-7 2/8 infiniband
mlx5_4 port 1 ==> ibp77s0 (Down) # CX-7 3/8 infiniband
mlx5_5 port 1 ==> ibp94s0 (Down) # CX-7 4/8 infiniband
mlx5_6 port 1 ==> ibp156s0 (Down) # CX-7 5/8 infiniband
mlx5_7 port 1 ==> enp157s0f0np0 (Up) # CX-6 1/2 *ethernet* port 1/2
mlx5_8 port 1 ==> enp157s0f1np1 (Down) # CX-6 1/2 *ethernet* port 2/2
mlx5_9 port 1 ==> ibp188s0 (Down) # CX-7 6/8 infiniband
The status for non-IB cards doesn't really matter here, but we're seeing something like this:
CA 'mlx5_0'
CA type: MT4129
Number of ports: 1
Firmware version: 28.39.1002
Hardware version: 0
Node GUID: 0x946dae0300afd548
System image GUID: 0x946dae0300afd548
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 2176
LMC: 0
SM lid: 1
Capability mask: 0xa751e848
Port GUID: 0x946dae0300afd548
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4125
Number of ports: 1
Firmware version: 22.36.1010
Hardware version: 0
Node GUID: 0xe8ebd30300f74afe
System image GUID: 0xe8ebd30300f74afe
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xeaebd3fffef74afe
Link layer: Ethernet
CA 'mlx5_10'
CA type: MT4129
Number of ports: 1
Firmware version: 28.39.1002
Hardware version: 0
Node GUID: 0x946dae0300afd518
System image GUID: 0x946dae0300afd518
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 891
LMC: 0
SM lid: 1
Capability mask: 0xa751e848
Port GUID: 0x946dae0300afd518
Link layer: InfiniBand
CA 'mlx5_11'
CA type: MT4129
Number of ports: 1
Firmware version: 28.39.1002
Hardware version: 0
Node GUID: 0x946dae0300afd278
System image GUID: 0x946dae0300afd278
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 3070
LMC: 0
SM lid: 1
Capability mask: 0xa751e848
Port GUID: 0x946dae0300afd278
Link layer: InfiniBand
CA 'mlx5_2'
CA type: MT4125
Number of ports: 1
Firmware version: 22.36.1010
Hardware version: 0
Node GUID: 0xe8ebd30300f74aff
System image GUID: 0xe8ebd30300f74afe
Port 1:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xeaebd3fffef74aff
Link layer: Ethernet
CA 'mlx5_3'
CA type: MT4129
Number of ports: 1
Firmware version: 28.39.1002
Hardware version: 0
Node GUID: 0x946dae0300afd538
System image GUID: 0x946dae0300afd538
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 1988
LMC: 0
SM lid: 1
Capability mask: 0xa751e848
Port GUID: 0x946dae0300afd538
Link layer: InfiniBand
CA 'mlx5_4'
CA type: MT4129
Number of ports: 1
Firmware version: 28.39.1002
Hardware version: 0
Node GUID: 0x946dae0300cdf85c
System image GUID: 0x946dae0300cdf85c
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 2545
LMC: 0
SM lid: 1
Capability mask: 0xa751e848
Port GUID: 0x946dae0300cdf85c
Link layer: InfiniBand
CA 'mlx5_5'
CA type: MT4129
Number of ports: 1
Firmware version: 28.39.1002
Hardware version: 0
Node GUID: 0x946dae0300afd2fc
System image GUID: 0x946dae0300afd2fc
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 4593
LMC: 0
SM lid: 1
Capability mask: 0xa751e848
Port GUID: 0x946dae0300afd2fc
Link layer: InfiniBand
CA 'mlx5_6'
CA type: MT4129
Number of ports: 1
Firmware version: 28.39.1002
Hardware version: 0
Node GUID: 0x946dae0300afd51c
System image GUID: 0x946dae0300afd51c
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 1174
LMC: 0
SM lid: 1
Capability mask: 0xa751e848
Port GUID: 0x946dae0300afd51c
Link layer: InfiniBand
CA 'mlx5_7'
``
CA type: MT4125
Number of ports: 1
Firmware version: 22.36.1010
Hardware version: 0
Node GUID: 0xe8ebd30300f73d26
System image GUID: 0xe8ebd30300f73d26
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xeaebd3fffef73d26
Link layer: Ethernet
CA 'mlx5_8'
CA type: MT4125
Number of ports: 1
Firmware version: 22.36.1010
Hardware version: 0
Node GUID: 0xe8ebd30300f73d27
System image GUID: 0xe8ebd30300f73d26
Port 1:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xeaebd3fffef73d27
Link layer: Ethernet
CA 'mlx5_9'
CA type: MT4129
Number of ports: 1
Firmware version: 28.39.1002
Hardware version: 0
Node GUID: 0x946dae0300afd2e0
System image GUID: 0x946dae0300afd2e0
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 4320
LMC: 0
SM lid: 1
Capability mask: 0xa751e848
Port GUID: 0x946dae0300afd2e0
Link layer: InfiniBand
Hello, thanks for sharing your work for doing GPU infrastructure. I'm wondering what is the expected output for
ibstat
on the H100 leaf nodes.