imbue-ai / cluster-health

MIT License
249 stars 32 forks source link

ibstat expected output #4

Closed asaiacai closed 1 month ago

asaiacai commented 2 months ago

Hello, thanks for sharing your work for doing GPU infrastructure. I'm wondering what is the expected output for ibstat on the H100 leaf nodes.

bawr commented 1 month ago

Hey - sorry for the late reply!

Happy to share, but just remember we're on a pure IB fabric, i.e. no RoCE or IPoIB, so YMMV.

First, let's just take a look at all the Connect-X cards on our hosts, with some extra comments. Note that up/down here is slightly misleading, since it's about the IP / ethernet side of things.

$ ibdev2netdev | column -t
mlx5_0   port  1  ==>  ibp26s0        (Down)  # CX-7 1/8 infiniband
mlx5_1   port  1  ==>  enp27s0f0np0   (Up)    # CX-6 2/2 *ethernet* port 1/2
mlx5_10  port  1  ==>  ibp204s0       (Down)  # CX-7 7/8 infiniband
mlx5_11  port  1  ==>  ibp220s0       (Down)  # CX-7 8/8 infiniband
mlx5_2   port  1  ==>  enp27s0f1np1   (Down)  # CX-6 2/2 *ethernet* port 2/2
mlx5_3   port  1  ==>  ibp60s0        (Down)  # CX-7 2/8 infiniband
mlx5_4   port  1  ==>  ibp77s0        (Down)  # CX-7 3/8 infiniband
mlx5_5   port  1  ==>  ibp94s0        (Down)  # CX-7 4/8 infiniband
mlx5_6   port  1  ==>  ibp156s0       (Down)  # CX-7 5/8 infiniband
mlx5_7   port  1  ==>  enp157s0f0np0  (Up)    # CX-6 1/2 *ethernet* port 1/2
mlx5_8   port  1  ==>  enp157s0f1np1  (Down)  # CX-6 1/2 *ethernet* port 2/2
mlx5_9   port  1  ==>  ibp188s0       (Down)  # CX-7 6/8 infiniband

The status for non-IB cards doesn't really matter here, but we're seeing something like this:


CA 'mlx5_0'
    CA type: MT4129
    Number of ports: 1
    Firmware version: 28.39.1002
    Hardware version: 0
    Node GUID: 0x946dae0300afd548
    System image GUID: 0x946dae0300afd548
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 400
        Base lid: 2176
        LMC: 0
        SM lid: 1
        Capability mask: 0xa751e848
        Port GUID: 0x946dae0300afd548
        Link layer: InfiniBand
CA 'mlx5_1'
    CA type: MT4125
    Number of ports: 1
    Firmware version: 22.36.1010
    Hardware version: 0
    Node GUID: 0xe8ebd30300f74afe
    System image GUID: 0xe8ebd30300f74afe
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 100
        Base lid: 0
        LMC: 0
        SM lid: 0
        Capability mask: 0x00010000
        Port GUID: 0xeaebd3fffef74afe
        Link layer: Ethernet
CA 'mlx5_10'
    CA type: MT4129
    Number of ports: 1
    Firmware version: 28.39.1002
    Hardware version: 0
    Node GUID: 0x946dae0300afd518
    System image GUID: 0x946dae0300afd518
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 400
        Base lid: 891
        LMC: 0
        SM lid: 1
        Capability mask: 0xa751e848
        Port GUID: 0x946dae0300afd518
        Link layer: InfiniBand
CA 'mlx5_11'
    CA type: MT4129
    Number of ports: 1
    Firmware version: 28.39.1002
    Hardware version: 0
    Node GUID: 0x946dae0300afd278
    System image GUID: 0x946dae0300afd278
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 400
        Base lid: 3070
        LMC: 0
        SM lid: 1
        Capability mask: 0xa751e848
        Port GUID: 0x946dae0300afd278
        Link layer: InfiniBand
CA 'mlx5_2'
    CA type: MT4125
    Number of ports: 1
    Firmware version: 22.36.1010
    Hardware version: 0
    Node GUID: 0xe8ebd30300f74aff
    System image GUID: 0xe8ebd30300f74afe
    Port 1:
        State: Down
        Physical state: Disabled
        Rate: 40
        Base lid: 0
        LMC: 0
        SM lid: 0
        Capability mask: 0x00010000
        Port GUID: 0xeaebd3fffef74aff
        Link layer: Ethernet
CA 'mlx5_3'
    CA type: MT4129
    Number of ports: 1
    Firmware version: 28.39.1002
    Hardware version: 0
    Node GUID: 0x946dae0300afd538
    System image GUID: 0x946dae0300afd538
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 400
        Base lid: 1988
        LMC: 0
        SM lid: 1
        Capability mask: 0xa751e848
        Port GUID: 0x946dae0300afd538
        Link layer: InfiniBand
CA 'mlx5_4'
    CA type: MT4129
    Number of ports: 1
    Firmware version: 28.39.1002
    Hardware version: 0
    Node GUID: 0x946dae0300cdf85c
    System image GUID: 0x946dae0300cdf85c
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 400
        Base lid: 2545
        LMC: 0
        SM lid: 1
        Capability mask: 0xa751e848
        Port GUID: 0x946dae0300cdf85c
        Link layer: InfiniBand
CA 'mlx5_5'
    CA type: MT4129
    Number of ports: 1
    Firmware version: 28.39.1002
    Hardware version: 0
    Node GUID: 0x946dae0300afd2fc
    System image GUID: 0x946dae0300afd2fc
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 400
        Base lid: 4593
        LMC: 0
        SM lid: 1
        Capability mask: 0xa751e848
        Port GUID: 0x946dae0300afd2fc
        Link layer: InfiniBand
CA 'mlx5_6'
    CA type: MT4129
    Number of ports: 1
    Firmware version: 28.39.1002
    Hardware version: 0
    Node GUID: 0x946dae0300afd51c
    System image GUID: 0x946dae0300afd51c
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 400
        Base lid: 1174
        LMC: 0
        SM lid: 1
        Capability mask: 0xa751e848
        Port GUID: 0x946dae0300afd51c
        Link layer: InfiniBand
CA 'mlx5_7'
``
    CA type: MT4125
    Number of ports: 1
    Firmware version: 22.36.1010
    Hardware version: 0
    Node GUID: 0xe8ebd30300f73d26
    System image GUID: 0xe8ebd30300f73d26
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 100
        Base lid: 0
        LMC: 0
        SM lid: 0
        Capability mask: 0x00010000
        Port GUID: 0xeaebd3fffef73d26
        Link layer: Ethernet
CA 'mlx5_8'
    CA type: MT4125
    Number of ports: 1
    Firmware version: 22.36.1010
    Hardware version: 0
    Node GUID: 0xe8ebd30300f73d27
    System image GUID: 0xe8ebd30300f73d26
    Port 1:
        State: Down
        Physical state: Disabled
        Rate: 40
        Base lid: 0
        LMC: 0
        SM lid: 0
        Capability mask: 0x00010000
        Port GUID: 0xeaebd3fffef73d27
        Link layer: Ethernet
CA 'mlx5_9'
    CA type: MT4129
    Number of ports: 1
    Firmware version: 28.39.1002
    Hardware version: 0
    Node GUID: 0x946dae0300afd2e0
    System image GUID: 0x946dae0300afd2e0
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 400
        Base lid: 4320
        LMC: 0
        SM lid: 1
        Capability mask: 0xa751e848
        Port GUID: 0x946dae0300afd2e0
        Link layer: InfiniBand