k8snetworkplumbingwg / sriov-network-operator

Operator for provisioning and configuring SR-IOV CNI plugin and device plugin
Apache License 2.0
84 stars 114 forks source link

[metrics 5/x] Add node label to sriov_* metrics #774

Closed zeeke closed 1 month ago

zeeke commented 2 months ago

It might happen that two SR-IOV pods, deployed on different node, are using devices with the same PCI address. In such cases, the query suggested [1] by the sriov-network-metrics-exporter produces the error:


Error loading values found duplicate series for the match group {pciAddr="0000:3b:02.4"} on the right hand-side of the operation:
    [
        {
            __name__="sriov_kubepoddevice",
            container="test",
            dev_type="openshift.io/intelnetdevice",
            endpoint="sriov-network-metrics",
            instance="10.1.98.60:9110",
            job="sriov-network-metrics-exporter-service",
            namespace="cnf-4916",
            pciAddr="0000:3b:02.4",
            pod="pod-cnfdr22.telco5g.eng.rdu2.redhat.com",
            prometheus="openshift-monitoring/k8s",
            service="sriov-network-metrics-exporter-service"
        }, {
            __name__="sriov_kubepoddevice",
            container="test",
            dev_type="openshift.io/intelnetdevice",
            endpoint="sriov-network-metrics",
            instance="10.1.98.230:9110",
            job="sriov-network-metrics-exporter-service",
            namespace="cnf-4916",
            pciAddr="0000:3b:02.4",
            pod="pod-dhcp-98-230.telco5g.eng.rdu2.redhat.com",
            prometheus="openshift-monitoring/k8s",
            service="sriov-network-metrics-exporter-service"
        }
    ];many-to-many matching not allowed: matching labels must be unique on one side

Configure the ServiceMonitor resource to add a node label to all metrics. The right query to get metrics, as updated in the PrometheusRule, will be:

sriov_vf_tx_packets * on (pciAddr,node) group_left(pod,namespace,dev_type) sriov_kubepoddevice

Also remove pod, namespace and container label from the sriov_vf_* metrics, as they were wrongly set to sriov-network-metrics-exporter-zj2n9, openshift-sriov-network-operator, kube-rbac-proxy

[1] https://github.com/k8snetworkplumbingwg/sriov-network-metrics-exporter/blob/0f6a784f377ede87b95f31e569116ceb9775b5b9/README.md?plain=1#L38

github-actions[bot] commented 2 months ago

Thanks for your PR, To run vendors CIs, Maintainers can use one of:

coveralls commented 2 months ago

Pull Request Test Coverage Report for Build 10980223379

Details


Totals Coverage Status
Change from base Build 10979758277: 0.02%
Covered Lines: 6628
Relevant Lines: 14711

šŸ’› - Coveralls
adrianchiris commented 2 months ago

@zeeke can you rebase this one ?