[metrics 5/x] Add node label to sriov_* metrics

zeeke commented 2 months ago

It might happen that two SR-IOV pods, deployed on different node, are using devices with the same PCI address. In such cases, the query suggested [1] by the sriov-network-metrics-exporter produces the error:


Error loading values found duplicate series for the match group {pciAddr="0000:3b:02.4"} on the right hand-side of the operation:
    [
        {
            __name__="sriov_kubepoddevice",
            container="test",
            dev_type="openshift.io/intelnetdevice",
            endpoint="sriov-network-metrics",
            instance="10.1.98.60:9110",
            job="sriov-network-metrics-exporter-service",
            namespace="cnf-4916",
            pciAddr="0000:3b:02.4",
            pod="pod-cnfdr22.telco5g.eng.rdu2.redhat.com",
            prometheus="openshift-monitoring/k8s",
            service="sriov-network-metrics-exporter-service"
        }, {
            __name__="sriov_kubepoddevice",
            container="test",
            dev_type="openshift.io/intelnetdevice",
            endpoint="sriov-network-metrics",
            instance="10.1.98.230:9110",
            job="sriov-network-metrics-exporter-service",
            namespace="cnf-4916",
            pciAddr="0000:3b:02.4",
            pod="pod-dhcp-98-230.telco5g.eng.rdu2.redhat.com",
            prometheus="openshift-monitoring/k8s",
            service="sriov-network-metrics-exporter-service"
        }
    ];many-to-many matching not allowed: matching labels must be unique on one side

Configure the ServiceMonitor resource to add a node label to all metrics. The right query to get metrics, as updated in the PrometheusRule, will be:

sriov_vf_tx_packets * on (pciAddr,node) group_left(pod,namespace,dev_type) sriov_kubepoddevice

Also remove pod, namespace and container label from the sriov_vf_* metrics, as they were wrongly set to sriov-network-metrics-exporter-zj2n9, openshift-sriov-network-operator, kube-rbac-proxy

[1] https://github.com/k8snetworkplumbingwg/sriov-network-metrics-exporter/blob/0f6a784f377ede87b95f31e569116ceb9775b5b9/README.md?plain=1#L38

github-actions[bot] commented 2 months ago

Thanks for your PR, To run vendors CIs, Maintainers can use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:
/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor. Best regards.

coveralls commented 2 months ago

Pull Request Test Coverage Report for Build 10980223379

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.02%) to 45.055%

Totals
Change from base Build 10979758277:	0.02%
Covered Lines:	6628
Relevant Lines:	14711

💛 - Coveralls

adrianchiris commented 2 months ago

@zeeke can you rebase this one ?

k8snetworkplumbingwg / sriov-network-operator