jenningsloy318 / redfish_exporter

exporter to get metrics from redfish based hardware such as lenovo/dell/superc servers
Apache License 2.0
70 stars 61 forks source link

collected metric "redfish_system_pcie_function_state" ... was collected before with the same name and label values on PERC H730 Mini #71

Open fschlich opened 11 months ago

fschlich commented 11 months ago

We have a few older Dell systems that have a PERC H730 Mini integrated RAID controller. On these systems, redfish_exporter (latest git: e28371ddb) throws a fatal error, while it used to work ok prior to the collection of more detailed PCIe metrics:

An error has occurred while serving metrics:

2 error(s) occurred:
* [from Gatherer #2] collected metric "redfish_system_pcie_function_state" { label:<name:"hostname" value:"" > label:<name:"pci_function_deviceclass" value:"UnclassifiedDevice" > label:<name:"pci_function_type" value:"Physical" > label:<name:"pcie_function_id" value:"0-0-0" > label:<name:"pcie_function_name" value:"PERC H730 Mini" > label:<name:"resource" value:"pcie_function" > gauge:<value:1 > } was collected before with the same name and label values
* [from Gatherer #2] collected metric "redfish_system_pcie_function_health_state" { label:<name:"hostname" value:"" > label:<name:"pci_function_deviceclass" value:"UnclassifiedDevice" > label:<name:"pci_function_type" value:"Physical" > label:<name:"pcie_function_id" value:"0-0-0" > label:<name:"pcie_function_name" value:"PERC H730 Mini" > label:<name:"resource" value:"pcie_function" > gauge:<value:1 > } was collected before with the same name and label values

I think perhaps these adapters don't report a "state" as the exporter expects it to do, this is the data from /redfish/v1/Systems/System.Embedded.1/Storage/RAID.Integrated.1-1:


{
  "@odata.context": "/redfish/v1/$metadata#Storage.Storage",
  "@odata.id": "/redfish/v1/Systems/System.Embedded.1/Storage/RAID.Integrated.1-1",
  "@odata.type": "#Storage.v1_4_0.Storage",
  "Description": "PERC H730 Mini",
  "Drives": [
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/Storage/Drives/Disk.Bay.0:Enclosure.Internal.0-1:RAID.Integrated.1-1"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/Storage/Drives/Disk.Bay.1:Enclosure.Internal.0-1:RAID.Integrated.1-1"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/Storage/Drives/Disk.Bay.2:Enclosure.Internal.0-1:RAID.Integrated.1-1"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/Storage/Drives/Disk.Bay.3:Enclosure.Internal.0-1:RAID.Integrated.1-1"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/Storage/Drives/Disk.Bay.4:Enclosure.Internal.0-1:RAID.Integrated.1-1"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/Storage/Drives/Disk.Bay.5:Enclosure.Internal.0-1:RAID.Integrated.1-1"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/Storage/Drives/Disk.Bay.6:Enclosure.Internal.0-1:RAID.Integrated.1-1"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/Storage/Drives/Disk.Bay.7:Enclosure.Internal.0-1:RAID.Integrated.1-1"
    }
  ],
  "Drives@odata.count": 8,
  "Id": "RAID.Integrated.1-1",
  "Links": {
    "Enclosures": [
      {
        "@odata.id": "/redfish/v1/Chassis/Enclosure.Internal.0-1:RAID.Integrated.1-1"
      },
      {
        "@odata.id": "/redfish/v1/Chassis/System.Embedded.1"
      }
    ],
    "Enclosures@odata.count": 2
  },
  "Name": "PERC H730 Mini",
  "Status": {
    "Health": "OK",
    "HealthRollup": "OK",
    "State": "Enabled"
  },
  "StorageControllers": [
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/StorageControllers/RAID.Integrated.1-1",
      "Assembly": {
        "@odata.id": "/redfish/v1/Chassis/System.Embedded.1/Assembly"
      },
      "FirmwareVersion": "25.5.6.0009",
      "Identifiers": [
        {
          "DurableName": "544A842006943000",
          "DurableNameFormat": "NAA"
        }
      ],
      "Links": {},
      "Manufacturer": "DELL",
      "MemberId": "RAID.Integrated.1-1",
      "Model": "PERC H730 Mini",
      "Name": "PERC H730 Mini",
      "SpeedGbps": 12,
      "Status": {
        "Health": "OK",
        "HealthRollup": "OK",
        "State": "Enabled"
      },
      "SupportedControllerProtocols": [
        "PCIe"
      ],
      "SupportedDeviceProtocols": [
        "SAS",
        "SATA"
      ]
    }
  ],
  "StorageControllers@odata.count": 1,
  "Volumes": {
    "@odata.id": "/redfish/v1/Systems/System.Embedded.1/Storage/RAID.Integrated.1-1/Volumes"
  }
}
jenningsloy318 commented 9 months ago

Hi, your error shows that it occured when scraping pcie_function, but you don't post it. you just post the storage/RAID output, can you please confirm.

fschlich commented 9 months ago

ok, so /redfish/v1/Systems/System.Embedded.1 has a few PCIeFunctions:

  "PCIeFunctions": [
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/130-0-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/130-0-1"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/9-0-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-0-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-23-4"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-29-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-31-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/2-0-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/1-0-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/1-0-1"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/1-0-2"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/1-0-3"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-26-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-49-2"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-3-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-28-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-28-7"
    }
  ],
  "PCIeFunctions@odata.count": 17,

and I read from the error message that it is 0-0-0 which we're interested in, so this is /redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-0-0:

{
  "@odata.context": "/redfish/v1/$metadata#PCIeFunction.PCIeFunction",
  "@odata.etag": "1693376981",
  "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-0-0",
  "@odata.type": "#PCIeFunction.v1_1_1.PCIeFunction",
  "ClassCode": "0x000006",
  "Description": "Xeon E7 v3/Xeon E5 v3/Core i7 DMI2",
  "DeviceClass": "Bridge",
  "DeviceId": "0x2f00",
  "FunctionId": 0,
  "FunctionType": "Physical",
  "Id": "0-0-0",
  "Links": {
    "Drives": [],
    "Drives@odata.count": 0,
    "EthernetInterfaces": [],
    "EthernetInterfaces@odata.count": 0,
    "PCIeDevice": {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevice/0-0"
    },
    "StorageControllers": [],
    "StorageControllers@odata.count": 0
  },
  "Name": "Xeon E7 v3/Xeon E5 v3/Core i7 DMI2",
  "RevisionId": "0x02",
  "Status": {
    "Health": "OK",
    "HealthRollup": "OK",
    "State": "Enabled"
  },
  "SubsystemId": "0x0000",
  "SubsystemVendorId": "0x8086",
  "VendorId": "0x8086"
}

Is that helpful? I'm happy to post more, please explain in detail what you might need

jenningsloy318 commented 9 months ago

not exactly, you error message

* [from Gatherer #2] collected metric "redfish_system_pcie_function_state" { label:<name:"hostname" value:"" > label:<name:"pci_function_deviceclass" value:"UnclassifiedDevice" > label:<name:"pci_function_type" value:"Physical" > label:<name:"pcie_function_id" value:"0-0-0" > label:<name:"pcie_function_name" value:"PERC H730 Mini" > label:<name:"resource" value:"pcie_function" > gauge:<value:1 > } was collected before with the same name and label values
* [from Gatherer #2] collected metric "redfish_system_pcie_function_health_state" { label:<name:"hostname" value:"" > label:<name:"pci_function_deviceclass" value:"UnclassifiedDevice" > label:<name:"pci_function_type" value:"Physical" > label:<name:"pcie_function_id" value:"0-0-0" > label:<name:"pcie_function_name" value:"PERC H730 Mini" > label:<name:"resource" value:"pcie_function" > gauge:<value:1 > } was collected before with the same name and label values

which means that there must be some extra attribute to distinguish these metrics, so please help upload all api responses that match the errors exactly.

from you single pciefunction response, I can't differentiate which label I can add for it .

fschlich commented 9 months ago

ok, so three weeks ago I was confused, because what I was seeing didn't match my memories and I had a hard time reproducing the original issue. Today I took some more time and a systematic approach, and I am now certain that some servers which displayed this issue no longer do. On those servers, we have done firmware updates, among other things updating the "PowerEdge Server BIOS" from version 2.15 to 2.17.

On several boxes that still have a 2.15 or 2.13 BIOS and display the error, the output of /redfish/v1/Systems/System.Embedded.1 actually looks different to what I wrote three weeks ago: As you can see below, the PCIeFunction/0-0-0 is listed twice, and I guess that's the reason the exporter is scraping it twice, and unsurprisingly finds the same data twice.

Given that this is fixed in current firmware versions, I'm not sure if you want to change the exporter to guard against duplicate IDs, or just write it off as Dell's problem and close this issue?

$ curl https://..../redfish/v1/Systems/System.Embedded.1' | jq
...
  "PCIeFunctions": [
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/10-0-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-0-0"                      <==
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-23-4"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-29-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-31-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-0-0"                     <==
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/1-0-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/1-0-1"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/1-0-2"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/1-0-3"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-26-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-49-2"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-2-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-3-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-28-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-28-7"
    }
  ],
  "PCIeFunctions@odata.count": 16,
hanchao131415 commented 8 months ago

Browser to access http://172.100.70.202:9610/redfish? target=172.100.70.52 The result is:

`An error has occurred while serving metrics:

8 error(s) occurred:

==========================================================================

I use the postman test request/redfish/v1 / Systems/System. Embedded. 1 / result is:

"PCIeDevices": [ { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/177-0" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/177-0" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/0-31" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/0-23" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/0-28" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/4-0" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/202-0" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/49-0" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/3-0" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/0-17" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/0-31" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/0-28" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/4-0" } ], "PCIeDevices@odata.count": 13,

From the returned results can be found in the same @ odata. Id such as: "@ odata. Id" : "/ redfish/v1 / Systems/System. Embedded. 1 / PCIeDevices / 177-0"

===================================================================

我的服务器信息是dell PowerEdge R750 iDRAC9

fschlich commented 8 months ago

@hanchao131415 what is your BiosVersion value from /redfish/v1/Systems/System.Embedded.1? If it is less than 2.17.0, does the issue persist when you upgrade to the current server firmware?

hanchao131415 commented 8 months ago

@fschlich

"AssetTag": "", "Bios": { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/Bios" }, "BiosVersion": "1.8.2",

================================== My bios version is 1.8.2 and I have not upgraded the bios version

burdorff commented 6 months ago

"AssetTag":"","Bios":{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/Bios"},"BiosVersion":"2.9.0",

I see the issue here despite a bios of 2.9.0.

2 error(s) occurred:
* [from Gatherer #2] collected metric "redfish_system_pcie_function_state" { label:<name:"hostname" value:"--removed--" > label:<name:"pci_function_deviceclass" value:"UnclassifiedDevice" > label:<name:"pci_function_type" value:"Physical" > label:<name:"pcie_function_id" value:"0-0-0" > label:<name:"pcie_function_name" value:"PERC H710P Mini (for monolithics)" > label:<name:"resource" value:"pcie_function" > gauge:<value:1 > } was collected before with the same name and label values
* [from Gatherer #2] collected metric "redfish_system_pcie_function_health_state" { label:<name:"hostname" value:"--removed--" > label:<name:"pci_function_deviceclass" value:"UnclassifiedDevice" > label:<name:"pci_function_type" value:"Physical" > label:<name:"pcie_function_id" value:"0-0-0" > label:<name:"pcie_function_name" value:"PERC H710P Mini (for monolithics)" > label:<name:"resource" value:"pcie_function" > gauge:<value:1 > } was collected before with the same name and label values

~In my case it's possible that some examples (such as this one) have IDRAC7 (which still supports Redfish API).~ edit: confirmed on a 2.18.1 BIOS for IDRAC8

However the pcie_function 0-0-0 still appears twice despite the bios version: https://removed/redfish/v1/Systems/System.Embedded.1

"PCIeFunctions":[{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/6-0-0"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/2-0-0"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-0-0"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-29-0"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-31-0"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-31-2"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-0-0"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-1-0"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-28-4"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/8-0-0"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/8-0-1"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/8-0-2"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/8-0-3"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/2-0-1"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-26-0"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-3-0"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-28-0"},{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-28-7"}], "PCIeFunctions@odata.count":18,

burdorff commented 6 months ago

https://removed/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-0-0

{"@odata.context":"/redfish/v1/$metadata#PCIeFunction.PCIeFunction","@odata.etag":"1705552257","@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeFunction/0-0-0","@odata.type":"#PCIeFunction.v1_1_1.PCIeFunction","ClassCode":"0x000000","Description":"PERC H830 Adapter","DeviceClass":"UnclassifiedDevice","DeviceId":"0x005d","FunctionId":0,"FunctionType":"Physical","Id":"0-0-0","Links":{"Drives":[],"Drives@odata.count":0,"EthernetInterfaces":[],"EthernetInterfaces@odata.count":0,"PCIeDevice":{"@odata.id":"/redfish/v1/Systems/System.Embedded.1/PCIeDevice/0-0"},"StorageControllers":[],"StorageControllers@odata.count":0},"Name":"PERC H830 Adapter","RevisionId":"0x00","Status":{"Health":"OK","HealthRollup":"OK","State":"Enabled"},"SubsystemId":"0x1f41","SubsystemVendorId":"0x1028","VendorId":"0x1000"}

GregWhiteyBialas commented 5 months ago

Hi, I submitted PR which workarounds this problem. Any feedback is welcomed.