ibm-openbmc / openbmc

https://github.com
Other
19 stars 51 forks source link

1030.ips: The Health status of Powersupply shows Critical after AC #267

Closed lxwinspur closed 1 year ago

lxwinspur commented 1 year ago

On a 4U Ranier machine, after manual AC (4 power supplies are inserted in any order), every time the power supply that is inserted first will report a PSU_Kill fault

The following are the test steps

  1. Update BMC Fw or do a factory reset
  2. AC and Host does not power on
  3. check the Health status of the Power supply via GUI

Event Log: issues.tar.gz

image

lxwinspur commented 1 year ago

@mzipse @spinler Could you please take a look at this issue, thanks.

spinler commented 1 year ago

@lxwinspur Where is the PEL for the PSKILL error?

spinler commented 1 year ago

A few more questions:

  1. Where does plugging in the AC fall with the other steps you have listed?
  2. Can you provide the journal
  3. Does the status turn green again when you power on?
lxwinspur commented 1 year ago

@spinler https://github.com/ibm-openbmc/phosphor-power/blob/1030/phosphor-power-supply/power_supply.cpp#L1154 I trace here and found the hasPSKillFault() method returns true

And I checked /sys/kernel/debug/pmbus/hwmon8/status0_mfr file and the read value is 0x10, Other files(hwmon7/status0_mfr, hwmon9/status0_mfr, etc) are all 0

lxwinspur commented 1 year ago

Where does plugging in the AC fall with the other steps you have listed?

No, that's just my test steps, and we tested IBM's Fw, same issue.

Can you provide the journal

I just added a bit of printing information, and then checked the hasPSKillFault method through journal to return true

From afedb1e843ead7cfb86be39a21c08620f11bd31f Mon Sep 17 00:00:00 2001
From: George Liu <liuxiwei@inspur.com>
Date: Thu, 16 Feb 2023 08:51:11 +0800
Subject: [PATCH] Trace PS_Kill fault

Signed-off-by: George Liu <liuxiwei@inspur.com>
---
 phosphor-power-supply/power_supply.cpp | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/phosphor-power-supply/power_supply.cpp b/phosphor-power-supply/power_supply.cpp
index 8321a8a..39460dd 100644
--- a/phosphor-power-supply/power_supply.cpp
+++ b/phosphor-power-supply/power_supply.cpp
@@ -13,6 +13,7 @@
 #include <cmath>
 #include <cstdint> // uint8_t...
 #include <fstream>
+#include <iostream>
 #include <thread> // sleep_for()

 namespace phosphor::power::psu
@@ -1145,10 +1146,25 @@ void PowerSupply::getInputVoltage(double& actualInputVoltage,

 void PowerSupply::checkAvailability()
 {
+    std::cerr << "george: checkAvailability 0: available = " << available
+              << std::endl;
     bool origAvailability = available;
     available = present && !hasInputFault() && !hasVINUVFault() &&
                 !hasPSKillFault() && !hasIoutOCFault();

+    std::cerr << "george: checkAvailability 1: inventoryPath = "
+              << inventoryPath << ", origAvailability = " << origAvailability
+              << ", available = " << available << ", present = " << present
+              << ", hasInputFault = " << hasInputFault()
+              << ", hasVINUVFault = " << hasVINUVFault()
+              << ", hasPSKillFault = " << hasPSKillFault()
+              << ", hasIoutOCFault = " << hasIoutOCFault() << std::endl;
+
+    std::cerr << "george: checkAvailability 2: inputFault = " << inputFault
+              << ", vinUVFault = " << vinUVFault
+              << ", psKillFault =" << psKillFault
+              << ", ioutOCFault = " << ioutOCFault << std::endl;
+
     if (origAvailability != available)
     {
         auto invpath = inventoryPath.substr(strlen(INVENTORY_OBJ_PATH));
-- 
2.34.1

Does the status turn green again when you power on?

Yes, After the host power on, everything returns to normal, and all status turned green

mzipse commented 1 year ago

Per feedback from George, I understand this occurs only on the 4U system config.

lxwinspur commented 1 year ago

IPS lab tested several 4U rainier machines(rainier S1024 1600W PSU), and all have the same problem (IBM FW and IPS FW).

spinler commented 1 year ago

There's not much we can do about a power supply emitting a PSKill fault when it is plugged in. That would be an issue with the power supply firmware, for which there's no code update.

Probably IBM didn't notice it because nobody looked at that page after an AC cycle and before a power on.

The real reason for that availability stuff is because testers complained when a PS had its cord unplugged and it still showed green on the web UI, and they were very insistent when we tried to reject the defect.

I agree this looks bad though. It seems like there are two things that could be done:

  1. Not check for PSKill when power is off
  2. Not include PSKill in the availability check.

@derekhoward55 and our team will discuss.

lxwinspur commented 1 year ago

@spinler Thanks Matt So do you have a plan to upstream? should we remove the logic of PSU_KILL signal? I have a patch local, I hope you check whether it is correct?

From 889b9d3a1d3e5380038bd2d7e70c213755c73f90 Mon Sep 17 00:00:00 2001
From: George Liu <liuxiwei@inspur.com>
Date: Thu, 16 Feb 2023 08:51:11 +0800
Subject: [PATCH] Remove the logic of PSU_KILL signal

Reslove: https://github.com/ibm-openbmc/openbmc/issues/267

Signed-off-by: George Liu <liuxiwei@inspur.com>
---
 phosphor-power-supply/power_supply.cpp | 4 ++--
 phosphor-power-supply/psu_manager.cpp  | 7 -------
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/phosphor-power-supply/power_supply.cpp b/phosphor-power-supply/power_supply.cpp
index 8321a8a..7d77627 100644
--- a/phosphor-power-supply/power_supply.cpp
+++ b/phosphor-power-supply/power_supply.cpp
@@ -1146,8 +1146,8 @@ void PowerSupply::getInputVoltage(double& actualInputVoltage,
 void PowerSupply::checkAvailability()
 {
     bool origAvailability = available;
-    available = present && !hasInputFault() && !hasVINUVFault() &&
-                !hasPSKillFault() && !hasIoutOCFault();
+    available =
+        present && !hasInputFault() && !hasVINUVFault() && !hasIoutOCFault();

     if (origAvailability != available)
     {
diff --git a/phosphor-power-supply/psu_manager.cpp b/phosphor-power-supply/psu_manager.cpp
index cb83a2a..499ebe1 100644
--- a/phosphor-power-supply/psu_manager.cpp
+++ b/phosphor-power-supply/psu_manager.cpp
@@ -677,13 +677,6 @@ void PSUManager::analyze()
                                 additionalData);
                     psu->setFaultLogged();
                 }
-                else if (psu->hasPSKillFault())
-                {
-                    createError(
-                        "xyz.openbmc_project.Power.PowerSupply.Error.PSKillFault",
-                        additionalData);
-                    psu->setFaultLogged();
-                }
                 else if (psu->hasVoutOVFault())
                 {
                     // Include STATUS_VOUT for Vout faults.
-- 
2.34.1

Also, Will this Issue be resolved in the next release version of IPS branch? Thanks :)

lxwinspur commented 1 year ago

@spinler We tested this issue with this patch, it works fine. So should I send a PR into 1030.ips branch, or push a new patch to Gerrit upstream?

spinler commented 1 year ago

PSKill is still a valid fault as far as I know, so the PEL for it can't be removed from the code completely as your patch did above.

Since there was no PSKill PEL, do you know when it turned off? We'd only log a PEL for it if it was active with power on, so apparently before then it turned off.

We talked a bit about this the other day, and the reason PSKill was involved in that check is because it means the PS isn't operating, but the PEL for it doesn't call out the PS as high (it calls out a procedure first) so it won't get a FRU fault LED turned on by default, which would also set the health status on the web UI.

We plan on getting the PS probed when this occurs, but there is some higher priority PS debug going on at the moment.

It's possible we can wrap the PSKill and other faults in that availability check with a power on check...

lxwinspur commented 1 year ago

@spinler Thanks Matt But Will these changes be included in our next release(ips branch)? Because this is a serious problem for IPS.

spinler commented 1 year ago

Could you test a change like the following, where it only considers those faults if power is on:

bool faulted = isPowerOn &&  (hasVINUVFault() || hasPSKillFault()  || hasIoutOCFault());
available = present && !hasInputFault() && !faulted

The downside is the PowerSupply class doesn't currently have a power state, so it would have to get it from the manager class somehow.

If that works, you can just put it into gerrit and we'll pull it in.

lxwinspur commented 1 year ago

Thanks a lot We will test tomorrow.

lxwinspur commented 1 year ago

@spinler Thanks for you patch.

We tested it and works fine, We sent a PR to ibm-openbmc/phosphor-power/1050 branch, could you please take a look at this patch, and then please evaluate whether it can be merged into the ips branch?

lxwinspur commented 1 year ago

@spinler https://gerrit.openbmc.org/c/openbmc/phosphor-power/+/61181

Can we pull this patch to the 1030.ips branch today? Because the version provided by IBM for IPS needs to include this function this week, thanks.

lxwinspur commented 1 year ago

Merged by: https://gerrit.openbmc.org/c/openbmc/phosphor-power/+/61181