guilbaults / infiniband-exporter

Prometheus exporter for a Infiniband Fabric
Apache License 2.0
47 stars 21 forks source link

SymbolErrorCounter with postive value not exported #20

Closed gabrieleiannetti closed 3 years ago

gabrieleiannetti commented 3 years ago

Hi,

it looks like, that the metric symbolerrorcounter is not exported for positive values.

A short example:

  1. Run ibqueryerrors and verify for SymbolErrorCounter errors:
$ ibqueryerrors > ibqueryerrors.txt

$ grep "SymbolErrorCounter" ibqueryerrors.txt

GUID 0xc42a10300dcf8e2 port 1: [SymbolErrorCounter == 3] [PortXmitWait == 12097916]
GUID 0xc42a10300dd0bea port 1: [SymbolErrorCounter == 4] [PortXmitWait == 224462059]
  1. Run exporter, get exported metrics and check for the errors:
$ ./infiniband-exporter.py --verbose 2>exporter_stderr.txt

$ wget localhost:9683/metrics

$ grep "symbolerrorcounter" metrics | cut -d "}" -f 2 | sort | uniq -c
   2558  0.0
      1 # HELP infiniband_symbolerrorcounter_total Total number of minor link errors detected on one or more physical lanes.
      1 # TYPE infiniband_symbolerrorcounter_total counter

$ grep "0xc42a10300dcf8e2" metrics 
$ grep "0xc42a10300dd0bea" metrics 

As you can see no positive values for the symbolerrorcounter metric is exported, nor the both GUIDs are listed too.

For completeness I have added the redirected messages to stderr from the exporter:
exporter_stderr.txt

The GUIDs from above are not listed in the exporter_stderr.txt.

I would have expected, that the metrics for the GUIDs were exported.
The metric for PortXmitWait is also be missing then.

Can you please verify?

If I am not mistaken, then we should also check for other not exported metrics.

I would like to test the exporter with a local metrics file. But I do not get it working, for which I will create another issue.

Best
Gabriele

guilbaults commented 3 years ago

On my production system, the error counters are exported correctly, there might be a difference with the output of your ibqueryerrors that does not work with the existing regex. Im using Centos 7 and infiniband-diags-2.1.0-1.el7.x86_64 to provide ibqueryerrors.

I will check #21 in a few days and probably add a dump of our system to make it easier to run automated tests on a known input.

Screen Shot 2021-04-13 at 10 05 14 AM Screen Shot 2021-04-13 at 10 06 18 AM
gabrieleiannetti commented 3 years ago

Interesting, we are also using the same version of ibqueryerrors in version 2.1.0 on CentOS7:

yum list installed | grep infiniband-diags
infiniband-diags.x86_64        2.1.0-1.el7            @anaconda   

ibqueryerrors --version
ibqueryerrors BUILD VERSION: 2.1.0 Build date: Aug  9 2019 14:06:43
gabrieleiannetti commented 3 years ago

As figured out in https://github.com/guilbaults/infiniband-exporter/issues/21 the ibqueryerrors program needs to be executed with the proper parameter.

Then the exporter exports the SymbolErrorCounter errors.

This short file works for quick test: ibqueryerrors_short.txt