jenningsloy318 / redfish_exporter

exporter to get metrics from redfish based hardware such as lenovo/dell/superc servers
Apache License 2.0
70 stars 61 forks source link

fix pci devices duplicates #82

Open GregWhiteyBialas opened 5 months ago

GregWhiteyBialas commented 5 months ago

Hi, on some systems scraping fails and crashes exporter because of duplicated PCI devices in redfish output.

This patch aims to workaround this.

stmcginnis commented 5 months ago

Formatted to make it a little easier to read:

PCIeDevices:
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-7
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-4
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-8
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-7
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-2
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-8
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-1
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-3
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-0
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-8
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-7
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-7
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-3
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-8
GregWhiteyBialas commented 5 months ago

@stmcginnis Thanks!

I shouldn't be so laconic in description of this PR. So I am correcting myself. I will truncate all outputs for redability:

Curl of curl -s -k -u $REDFISH_USER:$REDFISH_PASS https://$REDFISH_HOST/redfish/v1/Systems/System.Embedded.1/ gives a lot of output in which we can find:

"PCIeDevices": [
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0"

and:

"PCIeFunctions": [
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-1"
    },

As you can see PCIeDevices/129-0 is listed twice without any data that allow us to differentiate in PCIeDevices but it is easy distinguishable in PCIeFunctions

Curl-ing this endpoint gives:

$ curl -s -k -u $REDFISH_USER:$REDFISH_PASS https://$REDFISH_HOST/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0| jq .
{
  "@odata.context": "/redfish/v1/$metadata#PCIeDevice.PCIeDevice",
  "@odata.etag": "1705495544",
  "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0",
  "@odata.type": "#PCIeDevice.v1_4_0.PCIeDevice",
  "Assembly": {
    "@odata.id": "/redfish/v1/Chassis/System.Embedded.1/Assembly"
  },
  "AssetTag": null,
  "Description": "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller",
  "DeviceType": "MultiFunction",
  "FirmwareVersion": "21.65.33.33",
  "Id": "129-0",
  "Links": {
    "Chassis": [
      {
        "@odata.id": "/redfish/v1/Chassis/System.Embedded.1"
      }
    ],
    "Chassis@odata.count": 1,
    "PCIeFunctions": [
      {
        "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-0"
      },
      {
        "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-1"
      }
    ],
    "PCIeFunctions@odata.count": 2
  },
  "Manufacturer": "Broadcom Inc. and subsidiaries",
  "Model": null,
  "Name": "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller",
  "PartNumber": "03T135449",
  "SKU": null,
  "SerialNumber": "WSEDERP01TX",
  "Status": {
    "Health": "OK",
    "HealthRollup": "OK",
    "State": "Enabled"
  }
}

So we see that this device has two functions. Which can be accessed by curl:

$ curl -s -k -u $REDFISH_USER:$REDFISH_PASS https://$REDFISH_HOST/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-1 | jq .
{
  "@odata.context": "/redfish/v1/$metadata#PCIeFunction.PCIeFunction",
  "@odata.etag": "1705676",
  "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-1",
  "@odata.type": "#PCIeFunction.v1_2_3.PCIeFunction",
  "ClassCode": "0x000002",
  "Description": "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller",
  "DeviceClass": "NetworkController",
  "DeviceId": "0x16d8",
  "FunctionId": 1,
  "FunctionType": "Physical",
  "Id": "129-0-1",
  "Links": {
    "Drives": [],
    "Drives@odata.count": 0,
    "EthernetInterfaces": [
      {
        "@odata.id": "/redfish/v1/Systems/System.Embedded.1/EthernetInterfaces/NIC.Slot.5-2-1"
      }
    ],
    "EthernetInterfaces@odata.count": 1,
    "PCIeDevice": {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0"
    },
    "StorageControllers": [],
    "StorageControllers@odata.count": 0
  },
  "Name": "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller",
  "Oem": {
    "Dell": {
      "@odata.type": "#DellOem.v1_2_0.DellOemResources",
      "DellPCIeFunction": {
        "@odata.context": "/redfish/v1/$metadata#DellPCIeFunction.DellPCIeFunction",
        "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-1/Oem/Dell/DellPCIeFunctions/NIC.Slot.5-2-1",
        "@odata.type": "#DellPCIeFunction.v1_3_0.DellPCIeFunction",
        "DataBusWidth": "16XOrX16",
        "Description": "An instance of DellPCIeFunction will have PCI device specific data.",
        "Id": "NIC.Slot.5-2-1",
        "LastSystemInventoryTime": "2024-01-17T12:45:34+00:00",
        "LastUpdateTime": "2023-01-24T10:02:15+00:00",
        "Name": "DellPCIeFunction",
        "SlotLength": "LongLength",
        "SlotType": "PCIExpressGen4"
      }
    }
  },
  "RevisionId": "0x01",
  "Status": {
    "Health": "OK",
    "HealthRollup": "OK",
    "State": "Enabled"
  },
  "SubsystemId": "0x4161",
  "SubsystemVendorId": "0x14e4",
  "VendorId": "0x14e4"
}
$ curl -s -k -u $REDFISH_USER:$REDFISH_PASS https://$REDFISH_HOST/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-0 | jq .
{
  "@odata.context": "/redfish/v1/$metadata#PCIeFunction.PCIeFunction",
  "@odata.etag": "17046744",
  "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-0",
  "@odata.type": "#PCIeFunction.v1_2_3.PCIeFunction",
  "ClassCode": "0x000002",
  "Description": "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller",
  "DeviceClass": "NetworkController",
  "DeviceId": "0x16d8",
  "FunctionId": 0,
  "FunctionType": "Physical",
  "Id": "129-0-0",
  "Links": {
    "Drives": [],
    "Drives@odata.count": 0,
    "EthernetInterfaces": [
      {
        "@odata.id": "/redfish/v1/Systems/System.Embedded.1/EthernetInterfaces/NIC.Slot.5-1-1"
      }
    ],
    "EthernetInterfaces@odata.count": 1,
    "PCIeDevice": {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0"
    },
    "StorageControllers": [],
    "StorageControllers@odata.count": 0
  },
  "Name": "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller",
  "Oem": {
    "Dell": {
      "@odata.type": "#DellOem.v1_2_0.DellOemResources",
      "DellPCIeFunction": {
        "@odata.context": "/redfish/v1/$metadata#DellPCIeFunction.DellPCIeFunction",
        "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-0/Oem/Dell/DellPCIeFunctions/NIC.Slot.5-1-1",
        "@odata.type": "#DellPCIeFunction.v1_3_0.DellPCIeFunction",
        "DataBusWidth": "16XOrX16",
        "Description": "An instance of DellPCIeFunction will have PCI device specific data.",
        "Id": "NIC.Slot.5-1-1",
        "LastSystemInventoryTime": "2024-01-17T12:45:34+00:00",
        "LastUpdateTime": "2023-01-24T10:02:15+00:00",
        "Name": "DellPCIeFunction",
        "SlotLength": "LongLength",
        "SlotType": "PCIExpressGen4"
      }
    }
  },
  "RevisionId": "0x01",
  "Status": {
    "Health": "OK",
    "HealthRollup": "OK",
    "State": "Enabled"
  },
  "SubsystemId": "0x4161",
  "SubsystemVendorId": "0x14e4",
  "VendorId": "0x14e4"
}

I see in code, that list of PCI devices isn't used to get PCI devices functions, which is pulled from other API endpoint, so this fix shouldn't have side effect in PCIFunction metrics not scraped.

I am not sure if this is bug in this version of firmware or specific hardware configuration which results in duplicated items.

stmcginnis commented 5 months ago

I think I've seen this bug in a vendors firmware implementation. It would be good to raise it with your vendor so they can fix it. Redfish definitely should not be reporting the same device twice.

Still good if there is a way to make handling of that condition more robust though.

GregWhiteyBialas commented 5 months ago

@jenningsloy318 is it possible to move this forward somehow?