guilbaults / infiniband-exporter

Prometheus exporter for a Infiniband Fabric
Apache License 2.0
47 stars 21 forks source link

query_smp.c:199; Connection timed out #12

Open mglants opened 3 years ago

mglants commented 3 years ago

After commit c6eef51246a91ccf81f2364c45ca7dfc9d266fc7 I got src/query_smp.c:199; umad (DR path slid 0; dlid 0; 0,1,10,19 Attr 0x11:0) bad status 110; Connection timed out infiniband_scrape_ok 0.0

mglants commented 3 years ago

@guilbaults

guilbaults commented 3 years ago

Could you manually run ibqueryerrors --verbose --details --suppress-common --data --report-port --switch to see if your host can access the counters on the fabric ? From the error message, it look like the error happens in ibquerryerrors and not in the parsing done by the python script.

mglants commented 3 years ago

Could you manually run ibqueryerrors --verbose --details --suppress-common --data --report-port --switch to see if your host can access the counters on the fabric ? From the error message, it look like the error happens in ibquerryerrors and not in the parsing done by the python script.

First string is about error, the next data is ok. What is the cause of the problem?

src/query_smp.c:199; umad (DR path slid 0; dlid 0; 0,1,10,19 Attr 0x11:0) bad status 110; Connection timed out
Errors for 0xb8599f0300383ec0 "ibswh2 [255-B05] HDR-40"
   GUID 0xb8599f0300383ec0 port ALL: [SymbolErrorCounter == 19817 (19.353K)] [LinkDownedCounter == 37 (37.000)] [PortXmitDiscards == 8 (8.000)] [PortLocalPhysicalErrors == 8] [PortXmitWait == 2417380683166 (2.199T)] [PortXmitData == 5044493407069892 (17.922PB)] [PortRcvData == 5044493406602707 (17.922PB)] [PortXmitPkts == 5101051455966 (4.639T)] [PortRcvPkts == 5101051441855 (4.639T)] [PortUnicastXmitPkts == 5101050102526 (4.639T)] [PortUnicastRcvPkts == 5101050102542 (4.639T)] [PortMulticastXmitPkts == 29380 (28.691K)] [PortMulticastRcvPkts == 15253 (14.896K)]
   GUID 0xb8599f0300383ec0 port 1: [LinkDownedCounter == 1 (1.000)] [PortXmitDiscards == 3 (3.000)] [PortLocalPhysicalErrors == 3] [PortXmitWait == 102856180328 (95.792G)] [PortXmitData == 225495513282581 (820.348TB)] [PortRcvData == 154914464326291 (563.576TB)] [PortXmitPkts == 229378238279 (213.625G)] [PortRcvPkts == 154134492149 (143.549G)] [PortUnicastXmitPkts == 229378226543 (213.625G)] [PortUnicastRcvPkts == 154134491212 (143.549G)] [PortMulticastXmitPkts == 11736 (11.461K)] [PortMulticastRcvPkts == 937 (937.000)]

Also output before commit c6eef51246a91ccf81f2364c45ca7dfc9d266fc7 just working fine

guilbaults commented 3 years ago

Its weird since c6eef51 did not change anything in parsing, but is simply capturing stderr and since you have something in stderr, it abort the collection cycle.

I don't have yet access to a HDR switch to verify (will do in a few months), but it seem there is a problem in ibqueryerrors since it only seem to show your local port. Some similar error can happen when the tools in the infiniband-diags rpm are not launched as root since some packets need to be sent with root (umad packets). You can try ibhosts and ibswitches, they should list everything connected in your network.

From what I know, 2 other things could block umad packets, running in a VM with SR-IOV and having some management keys installed on the fabric/subnet manager.

mglants commented 3 years ago

Found something on the internet Suspect #2: Unresponsive node/s issue

Looking for unresponsive nodes to fabric MADs. Nodes can get to this situation if there is any issue with OS, driver or card firmware. Once identified, it is recommended that the unresponsive nodes will not participate in any job in the fabric.

If there are any unresponsive nodes in the fabric, we can find them by invoking one of the direct path commands such as iblinkinfo, ibnetdiscover, ibswitches, ibhosts, ibnodes, ethc.

Run one of the direct path commands: iblinkinfo/ibnetdiscover/ibswitches/ibhosts/ibnodes If there are unresponsive nodes in the fabric, you will get 1 “Connection times out” line per unresponsive node at the start of the command output, with specific direct path to the node

Example:


root # ibnetdiscover

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,18 Attr 0xff90:2) bad status 110; Connection timed out

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,17 Attr 0xff90:2) bad status 110; Connection timed out

Topology file: generated on Mon Mar  2 17:19:19 2016

Initiated from node f4521403008b9a30 port f4521403008b9a31

Identify the unresponsive node/s: From the same node where the direct path command invoked, run: smpquery nd -D <direct_path_without_last_number>

Example: for direct path "0,1,18" invoke: "smpquery nd -D 0,1"

The unresponsive device is connected to the device outputted in last step by port number as the last number in the direct path

Example: for direct path "0,1,18", the unresponsive device will be connected to port 18

gabrieleiannetti commented 3 years ago

`` to see if your host can access the counters on the fabric ?

Yes, I can confirm that. We are also seeing errors caused by ibqueryerrors.

Exporter Output:

2021-04-09 14:07:48,156 - DEBUG - Start of collection cycle
2021-04-09 14:08:17,175 - ERROR - src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,15 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,5,14 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,13 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,22 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,23 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,13,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,14,34 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,15,2 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,15,11 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,18,2 Attr 0x11:0) bad status 110; Connection timed out

STDERR output from ibqueryerrors:

src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,15 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,5,14 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,13 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,22 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,23 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,13,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,19,13,14,34 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,19,13,15,2 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,19,13,15,11 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,19,13,18,2 Attr 0x11:0) bad status 110; Connection timed out

I would say, that the exporter works as designed with return on error implemented in https://github.com/guilbaults/infiniband-exporter/commit/c6eef51246a91ccf81f2364c45ca7dfc9d266fc7.

Previously running the exporter before commit https://github.com/guilbaults/infiniband-exporter/commit/c6eef51246a91ccf81f2364c45ca7dfc9d266fc7, we did not see any errors in the metrics exported by the exporter. So I doubt that there is no errors in the fabric, since also calling ibquererrors returns the shown errors.

guilbaults commented 3 years ago

When that error is printed in STDERR, is there any useful counters in STDOUT ? If some counters are collected correctly, then probably the script should not abort, but still report the error in the logs and exported stats.

gabrieleiannetti commented 3 years ago

Well, I am not familiar with the output yet, but it looks like it is full of errors.

I have attached the output with anonymous host names as XXX.

ibqueryerrors_nohostnames.txt

As I mentioned before, as the exporter was running without quitting on error,
we did not see any positive values in the exported metrics.

I checked every metric. There was a lot of metrics exported with 0 values and I could identify the nodes and switches in the fabric.

But the exporter should export those errors, right?

gabrieleiannetti commented 3 years ago
2021-04-09 14:07:48,156 - DEBUG - Start of collection cycle
2021-04-09 14:08:17,175 - ERROR - src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,15 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,5,14 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,13 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,22 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,23 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,29 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,9,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,17,5,13,32 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,14,34 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,15,2 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,15,11 Attr 0x11:0) bad status 110; Connection timed out
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,18,2 Attr 0x11:0) bad status 110; Connection timed out

We have identified those 13 bad status 110; Connection timed out errors as machines that are still in the fabric, but there are switched off. Think they have power on the ports of the NICs.

gabrieleiannetti commented 3 years ago

Identify the unresponsive node/s: From the same node where the direct path command invoked, run: smpquery nd -D <direct_path_without_last_number>

Example: for direct path "0,1,18" invoke: "smpquery n

Hi @MGlants, can you verify that the machines are up and running or are they switched off?

guilbaults commented 3 years ago

Well, I am not familiar with the output yet, but it looks like it is full of errors.

I have attached the output with anonymous host names as XXX.

ibqueryerrors_nohostnames.txt

PortXmitWait only indicate some congestion, most of them seem to be at their maximum values and need to be reset, but its not really an error. The collector can do the counter reset as needed with --can-reset-counter.

The most interresting errors are: SymbolErrorCounter -> Problem with a cable PortXmitDiscards -> Too much congestion and lossless feature of IB was not working, some packets were dropped LinkDownedCounter -> Can happen with a bad cable that flap between up/down PortRcvRemotePhysicalErrors -> Problem with a cable

Having some switched off or rebooting servers is normal, I might remove the early abort from c6eef51 if ibqueryerrors can collect correctly the other counters. I have updated my production server with the latest release, I will see if the problem happens often on our fabric.

mglants commented 3 years ago

Well, I am not familiar with the output yet, but it looks like it is full of errors. I have attached the output with anonymous host names as XXX. ibqueryerrors_nohostnames.txt

PortXmitWait only indicate some congestion, most of them seem to be at their maximum values and need to be reset, but its not really an error. The collector can do the counter reset as needed with --can-reset-counter.

The most interresting errors are: SymbolErrorCounter -> Problem with a cable PortXmitDiscards -> Too much congestion and lossless feature of IB was not working, some packets were dropped LinkDownedCounter -> Can happen with a bad cable that flap between up/down PortRcvRemotePhysicalErrors -> Problem with a cable

Having some switched off or rebooting servers is normal, I might remove the early abort from c6eef51 if ibqueryerrors can collect correctly the other counters. I have updated my production server with the latest release, I will see if the problem happens often on our fabric.

Can we also include information about ports in down state. In my case it was due we shutdown server for maintenance. As that errors have a path to port with problem

gabrieleiannetti commented 3 years ago

Well, I am not familiar with the output yet, but it looks like it is full of errors. I have attached the output with anonymous host names as XXX. ibqueryerrors_nohostnames.txt

PortXmitWait only indicate some congestion, most of them seem to be at their maximum values and need to be reset, but its not really an error. The collector can do the counter reset as needed with --can-reset-counter.

The most interresting errors are: SymbolErrorCounter -> Problem with a cable PortXmitDiscards -> Too much congestion and lossless feature of IB was not working, some packets were dropped LinkDownedCounter -> Can happen with a bad cable that flap between up/down PortRcvRemotePhysicalErrors -> Problem with a cable

Having some switched off or rebooting servers is normal, I might remove the early abort from c6eef51 if ibqueryerrors can collect correctly the other counters. I have updated my production server with the latest release, I will see if the problem happens often on our fabric.

I would like to propose, that the exporter does export a metric for that error "bad status 110; Connection timed out".

Would be also helpful to collect such errors in the fabric.

Thanks for sharing the most interesting errors.

gabrieleiannetti commented 3 years ago

Found something on the internet Suspect #2: Unresponsive node/s issue

Looking for unresponsive nodes to fabric MADs. Nodes can get to this situation if there is any issue with OS, driver or card firmware. Once identified, it is recommended that the unresponsive nodes will not participate in any job in the fabric.

If there are any unresponsive nodes in the fabric, we can find them by invoking one of the direct path commands such as iblinkinfo, ibnetdiscover, ibswitches, ibhosts, ibnodes, ethc.

Run one of the direct path commands: iblinkinfo/ibnetdiscover/ibswitches/ibhosts/ibnodes If there are unresponsive nodes in the fabric, you will get 1 “Connection times out” line per unresponsive node at the start of the command output, with specific direct path to the node

Example:


root # ibnetdiscover

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,18 Attr 0xff90:2) bad status 110; Connection timed out

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,17 Attr 0xff90:2) bad status 110; Connection timed out

Topology file: generated on Mon Mar  2 17:19:19 2016

Initiated from node f4521403008b9a30 port f4521403008b9a31

Identify the unresponsive node/s: From the same node where the direct path command invoked, run: smpquery nd -D <direct_path_without_last_number>

Example: for direct path "0,1,18" invoke: "smpquery nd -D 0,1"

The unresponsive device is connected to the device outputted in last step by port number as the last number in the direct path

Example: for direct path "0,1,18", the unresponsive device will be connected to port 18

Just another approach which gives a good overview of the Ports of the switch here:

iblinkinfo --node-name-map ibswitches_GUID-Name -D <direct_path_without_last_number>

In your example you should see at port 18 a line that will probably end with something like "Port not available".

gabrieleiannetti commented 3 years ago

@guilbaults

Hi, I would like to contribute a patch for processing an error from STDERR...

I have tried to check which errors can be returned by ibqueryerrors, but it is not clear to me.

My approach would be checking the return code first...

https://docs.oracle.com/cd/E88353_01/html/E72487/ibqueryerrors-8.html

EXIT STATUS
       -1 if scan fails.

       0 if scan succeeds without errors beyond thresholds

       1 if errors are found beyond thresholds or inconsistencies are found in
       check mode.

If the return code is -1 the exporter skips and sets the metric infiniband_scrape_ok=0.
Otherwise it continues, errors from stderr should always be checked then.

I would first implement processing of a "bad status" error.

As I figured out from the sources found here about this error (https://github.com/linux-rdma/infiniband-diags/blob/b48b4a630f54438ba2b529f40fab99c1abba3763/libibnetdisc/src/query_smp.c#L189), it looks like there is the following scheme:

    if ((status = umad_status(umad))) {
        IBND_ERROR("umad (%s Attr 0x%x:%u) bad status %d; %s\n",
               portid2str(&smp->path), smp->rpc.attr.id,
               smp->rpc.attr.mod, status, strerror(status));
        if (smp->rpc.attr.id == IB_ATTR_MLNX_EXT_PORT_INFO)
            rc = mlnx_ext_port_info_err(engine, smp, mad,
                            smp->cb_data);
    } else if ((status = mad_get_field(mad, 0, IB_DRSMP_STATUS_F))) {
        IBND_ERROR("mad (%s Attr 0x%x:%u) bad status 0x%x\n",
               portid2str(&smp->path), smp->rpc.attr.id,
               smp->rpc.attr.mod, status);
        if (smp->rpc.attr.id == IB_ATTR_MLNX_EXT_PORT_INFO)
            rc = mlnx_ext_port_info_err(engine, smp, mad,
                            smp->cb_data);

So I would suggest to add a new metric infiniband_bad_status... (data type?),
that should include the following labels:

As an example how it would look like with that data set:

src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,1 Attr 0xff90:1) bad status 110; Connection timed out

If the exporter detects an unknown error not bad status, it could log warnings and signal that something is wrong by setting infiniband_scrape_ok=0.

What do you think - Should we go that way?

gabrieleiannetti commented 3 years ago

I have built an example how to parse and process an STDERR error for bad status:

import re

pattern = r"src\/query\_smp\.c\:[\d]+\; (?:mad|umad) \((DR path .*) Attr .*\) bad status ([\d]+); (.*)"

string = "src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1,13,22,13,18,2 Attr 0x11:0) bad status 110; Connection timed out"

prog = re.compile(pattern)
result = prog.match(string)

if result:
    path = result.group(1)
    status = result.group(2)
    error = result.group(3)

else:
    pass # handle error...

I think this should do the job.

Probably I will create an pull request the next days...