ibm-openbmc / dev

Product Development Project Mgmt and Tracking
16 stars 2 forks source link

PCIe Slot LocationIndicatorActive PATCH is not working correctly all the time #3631

Closed gtmills closed 1 year ago

gtmills commented 1 year ago

@jinuthomas FYI Internal defects are:

548223 545871 548225

Below is an Everest PATCH of all LEDs to true. You will notice not all go true. This has also been reported on Rainier. This has also been reported when moving the LocationIndicatorActive to false. This has been reported on the GUI.

curl -k  -X PATCH https://$bmc/redfish/v1/Chassis/chassis/PCIeSlots -d '{"Slots":[{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true},{"LocationIndicatorActive":true}]}'

curl -s -k https://$bmc/redfish/v1/Chassis/chassis/PCIeSlots | grep LocationIndicatorActive
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": true,
      "LocationIndicatorActive": false,

21 slots:

curl -k https://$bmc/redfish/v1/Chassis/chassis/PCIeSlots | grep LocationIndicatorActive | wc -l
21

Focus on the 21st

curl -k https://$bmc/redfish/v1/Chassis/chassis/PCIeSlots
{
  "[@odata.id](https://jazz07.rchland.ibm.com:13443/jazz/users/odata.id)": "/redfish/v1/Chassis/chassis/PCIeSlots",
  "[@odata.type](https://jazz07.rchland.ibm.com:13443/jazz/users/odata.type)": "#PCIeSlots.v1_5_0.PCIeSlots",
  "Id": "1",
  "Name": "PCIe Slot Information",
  "Slots": [
....   <- I cut out a bunch here.. to make this fit
    {
      "HotPluggable": false,
      "Lanes": 0,
      "Links": {
        "PCIeDevice": [
          {
            "[@odata.id](https://jazz07.rchland.ibm.com:13443/jazz/users/odata.id)": "/redfish/v1/Systems/system/PCIeDevices/chassis_motherboard_pcieslot8_pcie_card8"
          }
        ],
        "Processors": [
          {
            "[@odata.id](https://jazz07.rchland.ibm.com:13443/jazz/users/odata.id)": "/redfish/v1/Systems/system/Processors/dcm1-cpu1"
          }
        ],
        "[Processors@odata.count](mailto:Processors@odata.count)": 1
      },
      "Location": {
        "PartLocation": {
          "ServiceLabel": "U78D4.ND0.WZS0014-P0-C8"
        }
      },
      "LocationIndicatorActive": true,
      "Oem": {
        "[@odata.type](https://jazz07.rchland.ibm.com:13443/jazz/users/odata.type)": "#OemPCIeSlots.Oem",
        "IBM": {
          "[@odata.type](https://jazz07.rchland.ibm.com:13443/jazz/users/odata.type)": "#OemPCIeSlots.IBM",
          "LinkId": 0
        }
      },
      "SlotType": "FullLength"
    },
    {
      "HotPluggable": false,
      "Lanes": 0,
      "Links": {
        "PCIeDevice": [
          {
            "[@odata.id](https://jazz07.rchland.ibm.com:13443/jazz/users/odata.id)": "/redfish/v1/Systems/system/PCIeDevices/chassis_motherboard_pcieslot9_pcie_card9"
          }
        ],
        "Processors": [
          {
            "[@odata.id](https://jazz07.rchland.ibm.com:13443/jazz/users/odata.id)": "/redfish/v1/Systems/system/Processors/dcm1-cpu0"
          }
        ],
        "[Processors@odata.count](mailto:Processors@odata.count)": 1
      },
      "Location": {
        "PartLocation": {
          "ServiceLabel": "U78D4.ND0.WZS0014-P0-C9"
        }
      },
      "LocationIndicatorActive": false,   <--- Focus on this one, the 21st
      "Oem": {
        "[@odata.type](https://jazz07.rchland.ibm.com:13443/jazz/users/odata.type)": "#OemPCIeSlots.Oem",
        "IBM": {
          "[@odata.type](https://jazz07.rchland.ibm.com:13443/jazz/users/odata.type)": "#OemPCIeSlots.IBM",
          "LinkId": 0
        }
      },
      "SlotType": "FullLength"
    }
  ]
}
curl -k  -X PATCH https://$bmc/redfish/v1/Chassis/chassis/PCIeSlots -d '{"Slots":[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{"LocationIndicatorActive":true}]}'
curl -s -k https://$bmc/redfish/v1/Chassis/chassis/PCIeSlots | grep -A 8 -B 10 WZS0014-P0-C9
        ],
        "Processors": [
          {
            "[@odata.id](https://jazz07.rchland.ibm.com:13443/jazz/users/odata.id)": "/redfish/v1/Systems/system/Processors/dcm1-cpu0"
          }
        ],
        "[Processors@odata.count](mailto:Processors@odata.count)": 1
      },
      "Location": {
        "PartLocation": {
          "ServiceLabel": "U78D4.ND0.WZS0014-P0-C9"
        }
      },
      "LocationIndicatorActive": false,
      "Oem": {
        "[@odata.type](https://jazz07.rchland.ibm.com:13443/jazz/users/odata.type)": "#OemPCIeSlots.Oem",
        "IBM": {
          "[@odata.type](https://jazz07.rchland.ibm.com:13443/jazz/users/odata.type)": "#OemPCIeSlots.IBM",
          "LinkId": 0
jinuthomas commented 1 year ago

https://github.com/ibm-openbmc/bmcweb/pull/543. has caused the regression. At least try out all the tests or have the SME approve the changes before it is merged. The location code indicator was tested for two slots only in upstream and I think the same for downstream if I am not wrong, the issue is that the PCI slots C0 and C1 works fine , C2 is missing and C3 onwards will set the previous slot led, means if I trigger C3 then C2 gets enabled and if I trigger C4 then C3 gets enabled and so on , so the last Slot cannot be enabled , and that is why in the curl command output you see that the last is not set , ideally the other set ones are also wrong.

jinuthomas commented 1 year ago

The ObjPath that is got from GUI and passed to the set location indicator is the issue here.

ChicagoDuan commented 1 year ago

Hi @jinuthomas ,
My discord id is "Chicago Duan#3383". Can we communicate through discord? I am unable to reproduce this bug at the moment, but I have some ideas about it and I need your help with some testing.

ChicagoDuan commented 1 year ago

Can you do the following two tests?

  1. Use "busctl set-propert” to set LocationIndicatorActive to true for the 21st PCIe slot and check if it was successfully set to true.
  2. Move "index++" from line 281(https://github.com/ibm-openbmc/bmcweb/blob/1050/redfish-core/lib/pcie_slots.hpp#L281) to line 321. Please See:
                for (const auto& endpoint : endpoints)
                {
                    if (endpoint == validChassisPath)
                    {
                        index++;
                        updatePcieSlotsMaps[index] = path;
                    }
                }  

Then do the same test again to see if this bug will still appear.

jinuthomas commented 1 year ago

How was it initially tested when the upstream patch was implemented. @baemyung since you had pulled this from upstream and tested it earlier, could you please help @ChicagoDuan to fix the issue.

baemyung commented 1 year ago

Isn't this issue related to 1030 PR#363 -- which seems determined as not needed?

a4990c6d66 Santosh Puranik Fix PCIe Slot Count Check (#363)

https://ibm.ent.box.com/file/1108642405190?s=meet9iczbqeuv7cmkrw0bl3ac2w8me1h

a4990c6d66 Alpana Kumari Ravindra Fix PCIe Slot Count Check (#363) This commit is for slot count correction, and 1050 code handled it different way so this correction is not required there.

baemyung commented 1 year ago

It looks like that the pcie_slots code (for patch) has some issues. I'll investigate further and work on it.

baemyung commented 1 year ago

There are a few issues in the code.

1) Dealing with slotmaps. std::map<unsigned int, std::string> updatePcieSlotsMaps{};

// Global variable
std::map<unsigned int, std::string> updatePcieSlotsMaps{};
static void checkPCIeSlotsCount(...)
{
    dbus::utility::getSubTreePaths(...[](...) {
…
        unsigned int index = 0; 
        unsigned int count = 0;
        auto slotNum = pcieSlotsPaths.size();
        for (const auto& path : pcieSlotsPaths)
        {
            index++;
           dbus::utility::getAssociationEndPoints(....
                [count{++count}, slotNum,index](...) {
A)              for (const auto& endpoint : endpoints)
                {
                    if (endpoint == validChassisPath)
                    {
                        updatePcieSlotsMaps[index] = path; <==
                    }
                }
B)
                if (count == slotNum)
                {
                    // Last time DBus has returned
                    if (updatePcieSlotsMaps.size() == total)   <=== This may not be consitent.
                    {
                        callback(updatePcieSlotsMaps);
                    }

2) The comparison of ‘count’ and ‘slotNum’ seems incorrect.

However, ‘slotNum’ may not be the same as the input slot count. It is because, not all pcieslots in GetSubtree is valid.

Pcieslot12 is not a valid slot for this purpose in “association definitions”. As the result, the logic needs to filter out the invalid ones.

$ busctl get-property \
 xyz.openbmc_project.Inventory.Manager \
  /xyz/openbmc_project/inventory/system/chassis/motherboard/pcieslot12 \
 xyz.openbmc_project.Association.Definitions Associations
Failed to get property Associations on interface xyz.openbmc_project.Association.Definitions: Unknown interface xyz.openbmc_project.Association.Definitions or property Associations.

On the other hand, the other pcieslot gives the good result

$ busctl get-property \
  xyz.openbmc_project.Inventory.Manager \
  /xyz/openbmc_project/inventory/system/chassis/motherboard/pcieslot10 \
  xyz.openbmc_project.Association.Definitions Associations
a(sss) 4 "fault_identifying" "fault_identified_by" "/xyz/openbmc_project/led/groups/pcieslot10_fault" "identifying" "identified_by" "/xyz/openbmc_project/led/groups/pcieslot10_identify" "chassis" "inventory" "/xyz/openbmc_project/inventory/system/chassis" "upstream_processor" "pcie_slot" "/xyz/openbmc_project/inventory/system/chassis/motherboard/dcm0/cpu0"

In fact, this pcieslot12 is at the index of 21st in the slots. That’s why we’re seeing the issue at 21st entry – esp on rainer systems.

3) In addition, a few more weak things;

TODO:

This can be addressed by getting the valid pcieSlots first, and then validate it. Once it is validated, we can perform PATCH.

This approach will make the code cleaner and easier to read.

baemyung commented 1 year ago

I'll work on this.

jinuthomas commented 1 year ago

Thanks @baemyung this was the issue seen , thanks for working on this, not sure why we merge things in without testing it.

baemyung commented 1 year ago

Thanks @baemyung this was the issue seen , thanks for working on this, not sure why we merge things in without testing it.

It was tested (though it may be a limited test) as a part of https://github.com/ibm-openbmc/bmcweb/pull/543 but apparently the logic may not give the consistent result.

baemyung commented 1 year ago

PR https://github.com/ibm-openbmc/bmcweb/pull/712 is created to resolve this issue.

gtmills commented 1 year ago

Fixed by https://github.com/ibm-openbmc/bmcweb/pull/712

lxwinspur commented 1 year ago

@gtmills Close this issue?

gtmills commented 1 year ago

https://github.com/ibm-openbmc/bmcweb/pull/712 fixed. Closing