treydock / eseries_exporter

Apache License 2.0
5 stars 2 forks source link

BUG - InstanceDown #17

Closed SckyzO closed 3 years ago

SckyzO commented 3 years ago

Hello @treydock ,

I have a new issue with your exporter. Today, a new disk has been failed on an Eserie, but I don't know why, after that, collect do not work.

I have this message :

An error has occurred while serving metrics:

12 error(s) occurred:
* [from Gatherer #1] collected metric "eseries_drive_status" { label:<name:"slot" value:"33" > label:<name:"status" value:"optimal" > label:<name:"tray" value:"1" > gauge:<value:0 > } was collected before with the same name and label values
* [from Gatherer #1] collected metric "eseries_drive_status" { label:<name:"slot" value:"33" > label:<name:"status" value:"failed" > label:<name:"tray" value:"1" > gauge:<value:0 > } was collected before with the same name and label values
* [from Gatherer #1] collected metric "eseries_drive_status" { label:<name:"slot" value:"33" > label:<name:"status" value:"replaced" > label:<name:"tray" value:"1" > gauge:<value:0 > } was collected before with the same name and label values
* [from Gatherer #1] collected metric "eseries_drive_status" { label:<name:"slot" value:"33" > label:<name:"status" value:"bypassed" > label:<name:"tray" value:"1" > gauge:<value:0 > } was collected before with the same name and label values
* [from Gatherer #1] collected metric "eseries_drive_status" { label:<name:"slot" value:"33" > label:<name:"status" value:"unresponsive" > label:<name:"tray" value:"1" > gauge:<value:1 > } was collected before with the same name and label values
* [from Gatherer #1] collected metric "eseries_drive_status" { label:<name:"slot" value:"33" > label:<name:"status" value:"removed" > label:<name:"tray" value:"1" > gauge:<value:0 > } was collected before with the same name and label values
* [from Gatherer #1] collected metric "eseries_drive_status" { label:<name:"slot" value:"33" > label:<name:"status" value:"incompatible" > label:<name:"tray" value:"1" > gauge:<value:0 > } was collected before with the same name and label values
* [from Gatherer #1] collected metric "eseries_drive_status" { label:<name:"slot" value:"33" > label:<name:"status" value:"dataRelocation" > label:<name:"tray" value:"1" > gauge:<value:0 > } was collected before with the same name and label values
* [from Gatherer #1] collected metric "eseries_drive_status" { label:<name:"slot" value:"33" > label:<name:"status" value:"preFailCopy" > label:<name:"tray" value:"1" > gauge:<value:0 > } was collected before with the same name and label values
* [from Gatherer #1] collected metric "eseries_drive_status" { label:<name:"slot" value:"33" > label:<name:"status" value:"preFailCopyPending" > label:<name:"tray" value:"1" > gauge:<value:0 > } was collected before with the same name and label values
* [from Gatherer #1] collected metric "eseries_drive_status" { label:<name:"slot" value:"33" > label:<name:"status" value:"__UNDEFINED" > label:<name:"tray" value:"1" > gauge:<value:0 > } was collected before with the same name and label values
* [from Gatherer #1] collected metric "eseries_drive_status" { label:<name:"slot" value:"33" > label:<name:"status" value:"unknown" > label:<name:"tray" value:"1" > gauge:<value:0 > } was collected before with the same name and label values

For information I can access API :

curl --noproxy "*" -X GET -u user:password "http://localhost:8081/devmgr/v2/storage-systems/eseries_id1" | jq . 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2556  100  2556    0     0   832k      0 --:--:-- --:--:-- --:--:--  832k
{
  "id": "eseries_id1",
  "name": "da21",
  "wwn": "xxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "passwordStatus": "valid",
  "passwordSet": true,
  "status": "needsAttn",
  "certificateStatus": "trusted",
  "ip1": "xxx.xxx.xxx.xxx",
  "ip2": "xxx.xxx.xxx.xxx",
  "managementPaths": [
    "xxx.xxx.xxx.xxx",
    "xxx.xxx.xxx.xxx"
  ],
  "controllers": [
    {
      "controllerId": "070000000000000000000001",
      "ipAddresses": [
        "xxx.xxx.xxx.xxx"
      ],
      "certificateStatus": "trusted"
    },
    {
      "controllerId": "070000000000000000000002",
      "ipAddresses": [
        "xxx.xxx.xxx.xxx"
      ],
      "certificateStatus": "trusted"
    }
[...]

I try with eseries_exporter 1.10 and 1.20. same result.

Regards,

treydock commented 3 years ago

Based on the errors it looks like the ESeries API is reporting duplicate disks with slot 33. Try something like this and share the output:

# curl -u user:password http://localhost:8080/devmgr/v2/storage-systems/eseries_id1/hardware-inventory 2>/dev/null| jq -r '.drives[] | "SLOT: \(.physicalLocation.slot)\tTRAY: \(.physicalLocation.trayRef)"' | grep "SLOT: 33"
SLOT: 33        TRAY: 0E50080E520BA7D0000000000000000000000000
SLOT: 33        TRAY: 0E00000000000000000000000000000000000000
SLOT: 33        TRAY: 0E50080E5209C1A0000000000000000000000000

I am curious if maybe you have duplicate trays for the same slot number or something.

SckyzO commented 3 years ago

Whaou perfect ! I saw my problem. I have a disk which is hanged :) I need to replace it quickly and I think it will be work better :)

ESERIES_01

SLOT: 33        TRAY: 0E500A09800E36E5880000000000000000000000
SLOT: 33        TRAY: 0E500A09800E3757D10000000000000000000000
SLOT: 33        TRAY: 0E500A09800E33F4470000000000000000000000
SLOT: 33        TRAY: 0E500A09800E36E5A20000000000000000000000
SLOT: 33        TRAY: 0E500A09800E3759F60000000000000000000000
SLOT: 33        TRAY: 0E500A09800E3759FA0000000000000000000000
SLOT: 33        TRAY: 0E00000000000000000000000000000000000000
SLOT: 33        TRAY: 0E500A09800E3156800000000000000000000000
SLOT: 33        TRAY: 0E500A09800E3156800000000000000000000000
ESERIES_02

SLOT: 33        TRAY: 0E500A09800E33F60D0000000000000000000000
SLOT: 33        TRAY: 0E500A09800E39D8E60000000000000000000000
SLOT: 33        TRAY: 0E500A09800E33F5120000000000000000000000
SLOT: 33        TRAY: 0E500A09800E3A369E0000000000000000000000
SLOT: 33        TRAY: 0E00000000000000000000000000000000000000
SLOT: 33        TRAY: 0E500A09800E39F14F0000000000000000000000
SLOT: 33        TRAY: 0E500A09800E39D9590000000000000000000000
SLOT: 33        TRAY: 0E500A09800E39F10B0000000000000000000000

Same configuration, and as you can see, I have 8 SLOT33 on the good eseries ... and, 9 SLOT33 on other which is in error ... I Don't know why

treydock commented 3 years ago

That is very strange, but for ESERIES_01 the last 2 entries appear to be duplicates which would explain the errors you saw. I might be able to code a solution to avoid duplicates and instead just issue a log error but the exporter would not show as down as it would still return all metrics.

For my own curiosity, and might help me code a workaround, what does the same curl command but with this jq produce?

jq -r '.drives[] | "SLOT: \(.physicalLocation.slot)\tTRAY: \(.physicalLocation.trayRef)\tSTATUS: \(.status)"' | grep "SLOT: 33"

I get something like this:

SLOT: 33        TRAY: 0E50080E520BA7D0000000000000000000000000  STATUS: optimal
SLOT: 33        TRAY: 0E00000000000000000000000000000000000000  STATUS: optimal
SLOT: 33        TRAY: 0E50080E5209C1A0000000000000000000000000  STATUS: optimal

I am curious what TRAY 0E500A09800E3156800000000000000000000000 produces for the 2 entries.

SckyzO commented 3 years ago

Yes I know ... I think it's a bug ... (Release 11.70). I will open a case on Monday. Currently, I disable alert for this eseries.

Thank you for your help.

Regards

treydock commented 3 years ago

I pushed v1.2.1 that will skip trying to make metrics for duplicate drives. You will instead get eseries_exporter_collect_error metric set to 1 and a log message of error level.

SckyzO commented 3 years ago

Hello @treydock

Thank you very much, update is working perfectly. I open case on NetApp but I can monitor my systems :)

For information, if you want, I can share my grafana dashboard:

image

image

Use with Status Panel grafana plugin image