k8snetworkplumbingwg / sriov-network-metrics-exporter

Exporter that reads metrics for SR-IOV Virtual Functions and exposes them in the Prometheus format.
Apache License 2.0
21 stars 16 forks source link

Is ConnectX-6 supported? #19

Open jslouisyou opened 1 year ago

jslouisyou commented 1 year ago

Dear,

I'm currently using Mellanox ConnectX-6 Adapter (HPE InfiniBand HDR/Ethernet 200Gb 2-port QSFP56 PCIe4 x16 MCX653106A-HDAT Adapter) and trying to using sriov-network-metrics-exporter in Kubernetes cluster, but any sriov-network-metrics-exporter PODs can't get metrics from Infiniband Physical & Virtual Function when I tried to k exec -it -n monitoring sriov-metrics-exporter-lj8rq -- wget -O- localhost:9808/metrics (sriov-metrics-exporter-lj8rq is deployed exporter in Kubernetes cluster).

BTW, I digged into some codes in collectors/sriovdev.go and netClass is only defined for 0x020000.

var (
    sysBusPci             = flag.String("path.sysbuspci", "/sys/bus/pci/devices", "Path to sys/bus/pci on host")
    sysClassNet           = flag.String("path.sysclassnet", "/sys/class/net/", "Path to sys/class/net on host")
    netlinkEnabled        = flag.Bool("collector.netlink", true, "Enable or disable use of netlink for VF stats collection in favor of driver specific collectors.")
    totalVfFile           = "sriov_totalvfs"
    pfNameFile            = "/net"
    netClassFile          = "/class"
    driverFile            = "/driver"
    netClass        int64 = 0x020000
    vfStatSubsystem       = "vf"
    sriovDev              = "vfstats"
    sriovPFs              = make([]string, 0)
)

It seems that, in case of ConnectX-6, 0x020000 is only for ethernet adapter and 0x020700 is for Infiniband adapter.

Here's my environment as below - ibs is for Infiniband adapter and ens is for ethernet adapter.

$ mst status -v
.....
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA  
ConnectX6(rev:0)        /dev/mst/mt4123_pciconf5      c5:00.0   mlx5_2          net-ibs20                 5     
ConnectX6(rev:0)        /dev/mst/mt4123_pciconf4.1    b0:00.1   mlx5_6          net-ibs21f1               6     
ConnectX6(rev:0)        /dev/mst/mt4123_pciconf4      b0:00.0   mlx5_5          net-ens21f0np0            6     
ConnectX6(rev:0)        /dev/mst/mt4123_pciconf3.1    af:00.1   mlx5_4          net-ibs22f1               6     
ConnectX6(rev:0)        /dev/mst/mt4123_pciconf3      af:00.0   mlx5_3          net-ens22f0np0            6     
ConnectX6(rev:0)        /dev/mst/mt4123_pciconf2      85:00.0   mlx5_7          net-ibs19                 7     
ConnectX6(rev:0)        /dev/mst/mt4123_pciconf1      45:00.0   mlx5_0          net-ibs18                 1     
ConnectX6(rev:0)        /dev/mst/mt4123_pciconf0      0e:00.0   mlx5_1          net-ibs17                 3     

and ens* has 0x020000 class when I checked in below.

$ cat /sys/bus/pci/devices/0000\:b0\:00.0/class -> for net-ens21f0np0
0x020000

$ /sys/bus/pci/devices/0000\:af\:00.0/class -> for net-ens22f0np0
0x020000

but All Infiniband adapter has 0x020700 class accordingly,

$ cat /sys/bus/pci/devices/0000\:c5\:00.0/class -> for net-ibs20
0x020700

$ cat /sys/bus/pci/devices/0000\:b0\:00.1/class -> for net-ibs21f1
0x020700

... and so on

so I changed netClass from 0x020000 to 0x020700 and then sriov-network-metrics-exporter can find all IB PFs and VFs. Before change it, sriov-metrics-exporter POD is showing only ethernet adapters are caught;

2022/12/13 06:15:08 The kubepoddevice collector is enabled
2022/12/13 06:15:08 The vfstats collector is enabled
2022/12/13 06:15:08 listening on :9808
2022/12/13 06:15:26 using netlink for ens22f0np0
2022/12/13 06:15:26 PerPF called for ens22f0np0
2022/12/13 06:15:26 using netlink for ens21f0np0
2022/12/13 06:15:26 PerPF called for ens21f0np0
2022/12/13 06:15:56 using netlink for ens22f0np0
2022/12/13 06:15:56 PerPF called for ens22f0np0
2022/12/13 06:15:56 using netlink for ens21f0np0
2022/12/13 06:15:56 PerPF called for ens21f0np0
...

After change it, sriov-metrics-exporter POD can catch IB adapter;

2022/12/13 07:39:38 The vfstats collector is enabled
2022/12/13 07:39:38 The kubepoddevice collector is enabled
2022/12/13 07:39:38 listening on :9808
2022/12/13 07:39:39 using netlink for ibs21f1
2022/12/13 07:39:39 PerPF called for ibs21f1
2022/12/13 07:39:39 using netlink for ibs20
2022/12/13 07:39:39 PerPF called for ibs20
2022/12/13 07:39:39 using netlink for ibs17
2022/12/13 07:39:39 PerPF called for ibs17
2022/12/13 07:39:39 using netlink for ibs18
2022/12/13 07:39:39 PerPF called for ibs18
2022/12/13 07:39:39 using netlink for ibs19
2022/12/13 07:39:39 PerPF called for ibs19
2022/12/13 07:39:39 using netlink for ibs22f1
2022/12/13 07:39:39 PerPF called for ibs22f1
...

But all metrics except sriov_kubepoddevice show 0 in prometheus, even if I attach all VFs in each 2 POD and ran ib_send_bw between them.

# HELP sriov_vf_tx_packets Statistic tx_packets.
# TYPE sriov_vf_tx_packets counter
sriov_vf_tx_packets{numa_node="1",pciAddr="0000:45:00.1",pf="ibs18",vf="0"} 0
sriov_vf_tx_packets{numa_node="1",pciAddr="0000:45:00.2",pf="ibs18",vf="1"} 0
sriov_vf_tx_packets{numa_node="1",pciAddr="0000:45:00.3",pf="ibs18",vf="2"} 0
sriov_vf_tx_packets{numa_node="1",pciAddr="0000:45:00.4",pf="ibs18",vf="3"} 0
sriov_vf_tx_packets{numa_node="1",pciAddr="0000:45:00.5",pf="ibs18",vf="4"} 0
sriov_vf_tx_packets{numa_node="1",pciAddr="0000:45:00.6",pf="ibs18",vf="5"} 0
sriov_vf_tx_packets{numa_node="1",pciAddr="0000:45:00.7",pf="ibs18",vf="6"} 0
sriov_vf_tx_packets{numa_node="1",pciAddr="0000:45:01.0",pf="ibs18",vf="7"} 0
sriov_vf_tx_packets{numa_node="3",pciAddr="0000:0e:00.1",pf="ibs17",vf="0"} 0
sriov_vf_tx_packets{numa_node="3",pciAddr="0000:0e:00.2",pf="ibs17",vf="1"} 0
sriov_vf_tx_packets{numa_node="3",pciAddr="0000:0e:00.3",pf="ibs17",vf="2"} 0
sriov_vf_tx_packets{numa_node="3",pciAddr="0000:0e:00.4",pf="ibs17",vf="3"} 0
sriov_vf_tx_packets{numa_node="3",pciAddr="0000:0e:00.5",pf="ibs17",vf="4"} 0
sriov_vf_tx_packets{numa_node="3",pciAddr="0000:0e:00.6",pf="ibs17",vf="5"} 0
sriov_vf_tx_packets{numa_node="3",pciAddr="0000:0e:00.7",pf="ibs17",vf="6"} 0
sriov_vf_tx_packets{numa_node="3",pciAddr="0000:0e:01.0",pf="ibs17",vf="7"} 0
...

I think these PFs are not recognized in current pkg/vfstats/netlink.go.

So, is ConnectX-6 supported in current sriov-network-metrics-exporter? and if not, is there any plan for supporting ConnectX-6 later?

Thanks!

SchSeba commented 1 year ago

@Eoghan1232 I think right now it's not supported but it should be with the net implementation you are working on right?

eoghanlawless commented 1 year ago

Hey @jslouisyou,

The current version does not officially support Mellanox InfiniBand interfaces, though with the Netlink collector enabled and your device class change, it might work.

We are planning to support Mellanox cards with the latest Mellanox EN and OFED drivers.

jslouisyou commented 1 year ago

Thanks @eoghanlawless Are there any specific plans or release date?

eoghanlawless commented 1 year ago

We have a few changes coming soon, but haven't looked at implementing Mellanox InfiniBand support just yet.

The next release should be in the new year, and the following release should include InfiniBand support.

jslouisyou commented 1 year ago

@eoghanlawless Hello, recently I see that #24 (new 1.0 version) is going to be updated soon. Does it include Mellanox Infiniband (eg, ConnectX-6) support?

Thanks.

Eoghan1232 commented 1 year ago

Hi @jslouisyou - 1.0 version focusses on common functionality across vendors - and prioritizes Ethernet as the common use case - Mellanox bespoke stats for InfiniBand are outside the current scope. 

Intel do not currently support InfiniBand, and have no way to validate it's functionality.

Metrics provides an extendable interface for others to contribute, which could include InfiniBand support.