[POWER9] OCC locks up over time if power saving modes enabled

madscientist159 commented 5 years ago

After a random amount of time, the OCC locks up and downclocks the host if default Linux power saving modes are enabled. When this happens, not only is the machine slowed down but fan controls are completely disabled and a full host power-off reboot is required to restore functionality. opal-prd occ reset is sometimes able to restore proper clocks without a reboot but the fan controls remain offline without a full reboot.

Host dmesg:

[32809.485411] powernv-cpufreq: PMSR = 6363630040000000
[32809.520296] powernv-cpufreq: CPU Frequency could be throttled
[32844.323314] powernv-cpufreq: PMSR =  12d630000000000
[32844.356229] powernv-cpufreq: CPU Frequency could be throttled

BMC dmesg:

[390435.513335] sbefifo 00:00:00:06: SBE FFDC package len 9 words but only 6 remaining
[390435.521174] occ sbefifo1-dev0: SRAM attn returned failure status: 00fe000a
[390436.569249] sbefifo 00:00:00:06: SBE error cmd a4:04 status=00fe:000a
[390436.575975] sbefifo 00:00:00:06: SBE FFDC package len 9 words but only 6 remaining
[390436.583663] occ sbefifo1-dev0: SRAM attn returned failure status: 00fe000a
[390436.706403] sbefifo 01:01:00:06: SBE error cmd a4:04 status=00fe:000a
[390436.713044] sbefifo 01:01:00:06: SBE FFDC package len 9 words but only 6 remaining
[390436.720885] occ sbefifo2-dev0: SRAM attn returned failure status: 00fe000a
[390437.780615] sbefifo 01:01:00:06: SBE error cmd a4:04 status=00fe:000a
[390437.787314] sbefifo 01:01:00:06: SBE FFDC package len 9 words but only 6 remaining
[390437.795153] occ sbefifo2-dev0: SRAM attn returned failure status: 00fe000a
[390441.406828] sbefifo 01:01:00:06: SBE error cmd a4:04 status=00fe:000a
[390441.413412] sbefifo 01:01:00:06: SBE FFDC package len 9 words but only 6 remaining
[390441.421240] occ sbefifo2-dev0: SRAM attn returned failure status: 00fe000a
[390442.895970] sbefifo 01:01:00:06: SBE error cmd a4:04 status=00fe:000a
[390442.902534] sbefifo 01:01:00:06: SBE FFDC package len 9 words but only 6 remaining
[390442.910341] occ sbefifo2-dev0: SRAM attn returned failure status: 00fe000a

The following command, run at host start, completely prevents the issue from occurring, but is undesirable as the machines use significantly more power at idle:

echo 1 | tee /sys/devices/system/cpu/cpu*/cpuidle/state?/disable

Host Linux version: 4.19.0-5-powerpc64le OCC GIT hash: 58e422de

cjcain commented 5 years ago

Were any eSELs generated to the BMC? There should be HTMGT traces in /var/log/syslog or /var/log/messages on the host (collect SOS data). You should also confirm that the opal-prd service is running on the host (services opal-prd status)

madscientist159 commented 5 years ago

The only thing in /var/log/messages is:

Jul 10 20:52:03 host kernel: [32809.485411] powernv-cpufreq: PMSR = 6363630040000000
Jul 10 20:52:03 host kernel: [32809.520296] powernv-cpufreq: CPU Frequency could be throttled
Jul 10 20:52:38 host kernel: [32844.323314] powernv-cpufreq: PMSR =  12d630000000000
Jul 10 20:52:38 host kernel: [32844.356229] powernv-cpufreq: CPU Frequency could be throttled

opal-prd did record the error and a ton of debugging information however in /var/log/opal-prd.log See attached. opal-prd-occ-fail.log

cjcain commented 5 years ago

Looks like OCC generated a log:

Jul 10 20:52:03 host opal-prd: HBRT: HTMGT:>>processOccAttn(0x0x7fffb94241d0)
...
Jul 10 20:52:03 host opal-prd: HBRT: HTMGT:E>elogProcessActions: OCC1 requested a WOF reset
Jul 10 20:52:03 host opal-prd: HBRT: HTMGT:E>elogProcessActions: WOF Reset Reasons for OCC1 = 0x00000040
Jul 10 20:52:03 host opal-prd: HBRT: HTMGT:bldErrLog(mod: 0x03, rc: 0x2ADF=WOF_DISABLED, data: 0x00000040 00000040 00000000 11030206, sev: 0x00, fw:n

This indicates that there was a VFRT timeout (WOF_RC_VFRT_REQ_TIMEOUT 0x00000040) on the second processor and the OCC requests a reset to try to recover. There were some SCOM errors during the reset which is not expected. After the reset, the OCC in first processor was not responsive:

Jul 10 20:52:09 host opal-prd: HBRT: HTMGT:processOccStartStatus(Start Success=y, failedOcc=0x00000000)
Jul 10 20:52:24 host opal-prd: HBRT: HTMGT:E>_waitForOccCheckpoint OCC0 still NOT ready! (last checkpoint=0x0000)
Jul 10 20:52:24 host opal-prd: HBRT: HTMGT:bldErrLog(mod: 0x0B, rc: 0x2605, data: 0x00000000 00000000 00000000 00000000, sev: 0x20, fw:n
Jul 10 20:52:24 host opal-prd: HBRT: ERRL:E>Error created : PLID=8900004A, RC=2605, Mod=0B, Userdata=0000000000000000 0000000000000000, Sev=PREDICTIVE
Jul 10 20:52:25 host opal-prd: HBRT: HTMGT:E>OCCs not all active (rc=0x2605).  Attempting OCC Reset

After this next reset, the OCCs did seem to recover:

Jul 10 20:52:29 host opal-prd: HBRT: HTMGT:processOccStartStatus(Start Success=y, failedOcc=0x00000000)
...
Jul 10 20:52:38 host opal-prd: HBRT: HTMGT:OCC0 Poll change: Status:C300 Occs:03 Cfg:00 State:03 Error:00000000/00000000
Jul 10 20:52:38 host opal-prd: HBRT: HTMGT:pollRspHandler: updating OCC0 state to ACTIVE
...
Jul 10 20:52:38 host opal-prd: HBRT: HTMGT:OCC1 Poll change: Status:0300 Occs:02 Cfg:00 State:03 Error:00000000/00000000
Jul 10 20:52:38 host opal-prd: HBRT: HTMGT:pollRspHandler: updating OCC1 state to ACTIVE
...
Jul 10 20:52:38 host opal-prd: HBRT: HTMGT:_setOccState: All OCCs have reached state 0x03
Jul 10 20:52:38 host opal-prd: HBRT: HTMGT:setOccActiveSensors: active

At this point both OCCs are active and running and I do not see any further errors. From this trace it shows that there should have only been a few minutes in time where the frequency would have been impacted (due to OCCs in recovery.) I believe there should have been eSELs generated.

You can confirm the current state of the OCCs with:

occtoolp9 -I

If states are active you can send poll commands to the OCCs:

occtoolp9 -p
occtoolp9 -p -o 1

So you are still seeing performance impacts? What is the governor sent to?

madscientist159 commented 5 years ago

@cjcain The system in question was a critical production system. It has since been restarted with the stop modes disabled so I can't gather current diagnostic information about the fault.

That being said, our rack management systems confirm that the OCC did /not/ fully recover. At minimum the BMC was unable to reconnect to the OCC for fan management, and our experience is that once the server enters this state it tends to fail with a hard CPU lockup within a week.

cjcain commented 5 years ago

Was BMC data collected after the failure? (that should have the eSEL data)

madscientist159 commented 5 years ago

No, we tend not to gather eSEL data since we have no way of decoding / parsing it with the open source tools. My understanding was that opal-prd gathers basically the same diagnostic information, is this not the case?

cjcain commented 5 years ago

opal-prd gathers some error data and sends it to the BMC. The system should be handling the recovery attempts, but if there is a persistent error or need for service, then the data will be sent to the BMC in the form of an eSEL. There is also other data on the BMC that is not known by the host. The BMC also maintains some data across power cycles which the host does not always do. You can run dreport on the BMC to collect data.

madscientist159 commented 5 years ago

@cjcain Thanks for the info. I'm of the opinion that the OCC shouldn't be crashing at all when stop states are enabled -- we've gone well over a year without any OCC problems with the stop states disabled, but as soon as they are enabled OCC goes down within a few days.

The eSEL data may not be too useful (see the following event list). You can see that the failure isn't isolated to a single event; just trying to run the server with stop states enabled generated two faults in four days:

Entry  | ID            | Timestamp            | Serviceable  | Severity  | Message
1      | FQPSPCR0063F  | 2019-07-08 21:23:24  | No           | Warning   | Failed to read from the OCC on CPU 0
2      | FQPSPCR0063F  | 2019-07-08 21:23:26  | No           | Warning   | Failed to read from the OCC on CPU 1
3      | FQPSPCR0063F  | 2019-07-11 01:52:11  | No           | Warning   | Failed to read from the OCC on CPU 0
4      | FQPSPCR0063F  | 2019-07-11 01:52:14  | No           | Warning   | Failed to read from the OCC on CPU 1
5      | FQPSPCR0063F  | 2019-07-11 01:52:36  | No           | Warning   | Failed to read from the OCC on CPU 0

cjcain commented 5 years ago

What level/commit are you using for hostboot and hostboot-binaries?

madscientist159 commented 5 years ago

Hostboot 1e1ea69f, binaries (ring files) 2b4eedbd, HCODE 003286ea4

rbatraAustinIBM commented 5 years ago

@madscientist159 There was PGPE Hcode commit merged back in Feb/Mar that exacerbated a long standing bug. This fail signature of this issue matches with that. This issue was discovered end of May and a fix was put our in early Jun. Specifically, this commit fixes that issue and was pulled into open power on Jun 6th https://github.com/open-power/hcode/commit/cade83363c256a1f7810e7ec1b2937700ee7c0f9

Please pull that in and re-try. If the problem persists, then let me know.

madscientist159 commented 5 years ago

@rbatraAustinIBM Unfortunately the problem persists -- in fact it's gotten worse, when it disabled the BMC fan controls it locked one of the CPU fans to low speed so we had a box with a significantly overheating CPU until it was caught. Notably we have two machines that seem far more susceptible to this than the others, I'm not sure if this is because of workload or potentially marginal / faulty CPUs.

The SEL recorded a single instance of the same failure code as before on CPU1. opal-prd shows crash / recovery (attached), but the BMC again was unable to reestablish sensor reads so it locked the fan to a setting that could damage hardware.

BMC dmesg:

[310226.808382] sbefifo 00:00:00:06: SBE FFDC package len 9 words but only 6 remaining
[310226.816063] occ sbefifo1-dev0: SRAM attn returned failure status: 00fe000a
[310227.862235] sbefifo 00:00:00:06: SBE error cmd a4:04 status=00fe:000a
[310227.868921] sbefifo 00:00:00:06: SBE FFDC package len 9 words but only 6 remaining
[310227.876745] occ sbefifo1-dev0: SRAM attn returned failure status: 00fe000a
[310249.641800] sbefifo 00:00:00:06: SBE error cmd a4:04 status=00fe:000a
[310249.648524] sbefifo 00:00:00:06: SBE FFDC package len 9 words but only 6 remaining
[310249.656352] occ sbefifo1-dev0: SRAM attn returned failure status: 00fe000a

opal-prd-fail.log

madscientist159 commented 5 years ago

Digging a little further, the BMC was eventually able to restore communication, but it doesn't look like the OpenBMC userspace is written to handle the OCC dropping off and reappearing. Our watchdog (added in because the IBM thermal monitoring would lock up and drop off for a variety of reasons) looks for occ_error to be non-zero -- during / after this event occ_error remained zero so the userspace processes to read CPU1 temperatures on the BMC remained hung.

Until the BMC software stack can reliably handle the OCC disappearing and reappearing, I'm considering this a serious bug -- one of the main concerns is who has liability right now for CPUs that are destroyed due to overheating once the OCC crashes. If there's anything I can do to assist with debugging this please let me know ASAP.

Thanks!

marthabroyles commented 5 years ago

There are two different issues here, one is the OCC going thru a reset which the BMC must handle and this should not be causing fans to go low, fans should go high when the BMC detects that it cannot get temperatures anymore from the OCC. The BMC team should be looking at the fan issue. Did you gather the esels from this last time so we can see why the OCC was reset?

madscientist159 commented 5 years ago

@marthabroyles What's happening is the BMC user space isn't recovering from the OCC dropping off / reappearing. Working theory right now is that the hwmon DBUS proxy isn't detecting the device node going away, so keeps trying to read from a stale handle, since the process is present and the BMC kernel is able to read the temps after OCC recovery has finished. This means the fan controls are using stale data from when the CPU was in throttled mode, and allowing the reactivated CPU to overheat.

What I still don't understand is why the OCC resets are hitting only two machines out of a large number.

There was only one eSEL generated, for the OCC read failure BMC-side. Since this time the OCC recovered and the BMC fan controls could be manually reset, the machine is still online so I can gather additional info if needed.

sharkcz commented 5 years ago

@madscientist159 asked me to comment here, because I've also met the

úno 05 17:32:30 talos.danny.cz kernel: powernv-cpufreq: Pstate set to safe frequency
úno 05 17:32:30 talos.danny.cz kernel: powernv-cpufreq: PMSR = 6363630040000000
úno 05 17:32:30 talos.danny.cz kernel: powernv-cpufreq: CPU Frequency could be throttled

messages, happened twice in cca 1 year on my Talos with their last stable firmware. In both cases it was when I started a KVM based VM. Don't think I can provide more info except kernel versions for host & guest.

rbatraAustinIBM commented 5 years ago

@madscientist159 Chris Cain, pointed out from the HBRT log the following information about the image build level you are running with.

Jul 14 13:00:49 host opal-prd: HBRT: ISTEPS_TRACE:pullTraceBuildInfo: image build date = 20190713, build time = 0011, build tag = hw042519a.940

As you can see, the build tag is hw042519a.940 which is April 25th. However, the fix that I mentioned earlier went in on Jun 6th, so you will need to run with HW Image tag which at least Jun 5th level or higher. It should be named hw060519a.940

madscientist159 commented 5 years ago

@rbatraAustinIBM We build our own HCODE images, the linked patch was applied to our existing tree and the HCODE was rebuilt. That is why the tag is still the older string value.

Was there an additional patch that fixed this problem as well? We don't really want to have to go through the entire months-long re-qualification process starting with beta firmware again, which would be needed if we pulled the newest HCODE at this point.

cjcain commented 5 years ago

FYI, The symptoms in that last opal-prd-fail.log (from yesterday) were the same as the original failure. Which is why I had @rbatraAustinIBM confirm the code level.

rbatraAustinIBM commented 5 years ago

@madscientist159 You mentioned you are seeing this problem on couple of machines only. Have you ever seen the problem on these machines before even with older code level? Also, what was the previous working code level in terms HW image tag?

madscientist159 commented 5 years ago

@cjcain Understood, and that's why I'm wondering if there was a different patch that is also needed.

@rbatraAustinIBM Yes, we saw this problem basically from day 1. It's persisted across parts swaps (first thought was defective CPU / mainboard) and multiple firmware levels including revisions from back in 2018. That particular machine is a heavily loaded production VM host, which is also why it is not a trivial matter to get approval to test a firmware level against that particular system.

The other system still needs to have a parts swap done as a matter of routine to rule out hardware problems, but may be a better test bed if that pans out as it does not handle the same level of critical workloads.

marthabroyles commented 5 years ago

The handling of OCC resets should be looked at from the BMC side. The BMC is told whenever the OCC is being reset so the BMC will stop trying to communicate with the OCC and adjust the fans accordingly (higher) for not having temperatures from the OCC. Once the OCC reset is complete and the OCCs are running again the BMC is again informed so it can start communicating with the OCCs again. We use an OCC reset as a recovery method which must work, it is designed to allow 3 OCC resets before we consider there to be a real problem. In this case the OCC reset did recover the OCC but it sounds like the BMC did not recover which is a big problem. OCC resets is a known architected recovery mechanism that the BMC should be handling.

msbarth commented 5 years ago

To handle the OCC reset (active status), there is an openpower OCC control application(https://github.com/openbmc/openpower-occ-control) that was created to set an associated OCC active property on dbus and bind/unbind the driver per OCC. The active property controlled by this application would be the signal that can be configured within fan control to set fans to full speed and unload any events that use this active property as a precondition. Is this application being used and is fan control configured with a precondition against all OCC active properties that must pass to allow fan speed changes otherwise fans run at a set speed, i.e. full speed?

Once all OCCs become active again, the fan control's configured precondition would pass, allowing the events configured to be loaded, subscribing to the OCC sensors and enabling fan speed changes. (https://github.com/openbmc/phosphor-fan-presence/blob/master/control/preconditions.cpp#L17-L76)

madscientist159 commented 5 years ago

@msbarth Yes, it is being used and it did work as intended in every OCC failure up until the last one. The BMC revision level did not change, for some reason this latest OCC failure was not properly detected by the BMC and therefore the fans did not ramp up as they normally do (the ramped up fans cause a massive temperature drop, which shows on our rack management systems as an alert; this normally the first indication we get that the OCC has failed).

msbarth commented 5 years ago

@madscientist159 Ok, I've mentioned this issue to the maintainer of the OCC control application to take a look into that OCC-to-BMC communication path.

eddiejames commented 5 years ago

@madscientist159 I see from the BMC dmesg that there are errors communicating with the OCC. Can you tell if this is before, during, or after the OCC was reset?

What kernel is on your BMC?

You also said "the BMC kernel is able to read the temps after OCC recovery has finished." This means that the BMC recovered after the OCC reset fine? I don't think it's possible there is stale temperature data coming from the BMC OCC driver. If it can't get the poll response, it will return an error when the hwmon files are read.

madscientist159 commented 5 years ago

@eddiejames The dmesg failure entries on host and guest start at exactly the same time down to the second. I'm not comfortable trying to infer order below 1 second resolution since the BMC and host dmesg stamps can drift slightly compared to one another.

Yes, the BMC kernel recovered after the OCC reset this time. The BMC userspace for whatever reason did not. Normally the OCC reset knocks the BMC kernel out too, requiring a reboot of host and BMC.

Kernel 5.0.7 is on the BMC.

eddiejames commented 5 years ago

Normally the OCC reset breaks the BMC? That seems odd.

But there were a couple of fixes put into 5.1 for the OCC driver that could be helpful here. Specifically:

https://github.com/openbmc/linux/commit/b5c46a53ee6b1f97104883d93847b2b2fdddc289

Without this, it was possible to get stale data.

rbatraAustinIBM commented 5 years ago

@madscientist159 Ok, so we still need to understand why the failure occured in the first place. If you have the commit, then we would need to debug this further. Can you please gather the BMC FFDC at the point of failure?

cjcain commented 5 years ago

Let us first confirm that the PM Complex can get reset and still maintain communication with the BMC. Copy the occtoolp9 script to the host. Boot your system so that the OCCs are active and running. You can verify by running the following on the host:

#  occtoolp9 -I
==> opal-prd --expert-mode htmgt-passthru 0x01
HTMGT State: ACTIVE (0x03), 1 OCC (master:OCC0), resetCount:0
OCC0: Master ACTIVE (0x03) resetCount:0 wofResets:0  pollRsp:0xc3000100...

Then you can manually issue a reset (takes ~15 seconds) and then you should see the OCCs go active again and the reset count increment:

# opal-prd occ reset 
# ./occtoolp9 -I
==> opal-prd --expert-mode htmgt-passthru 0x01
HTMGT State: ACTIVE (0x03), 1 OCC (master:OCC0), resetCount:0
           : PM Complex Resets since boot: 1
OCC0: Master ACTIVE (0x03) resetCount:1 wofResets:0  pollRsp:0xc3000100...

If you continue to issue reset commands (total of 4), you can verify the system goes to safe mode:

# occtoolp9 -I
==> opal-prd --expert-mode htmgt-passthru 0x01
HTMGT State: ACTIVE (0x03), 1 OCC (master:OCC0), resetCount:0
           : PM Complex Resets since boot: 3
OCC0: Master ACTIVE (0x03) resetCount:3 wofResets:0  pollRsp:0xc3000100...
# opal-prd occ reset
# ./occtoolp9 -I
==> opal-prd --expert-mode htmgt-passthru 0x01
HTMGT State: SAFE   (0x04), 1 OCC (master:OCC0), resetCount:0,  (SAFE MODE: rc=0x2616/OCC0)
           : PM Complex Resets since boot: 4
OCC0: Master RESET  (0x85) resetCount:4 wofResets:0 FAILED  pollRsp:0xc3000100...

The OCCs/PM Complex should be held in reset and the system is running in safe mode. At this point, an eSEL should have been generated and sent to the BMC. This test should verify the full reset/recovery process as well sending eSELs to the BMC. You can send me the BMC FFDC and opal_prd.log if you need help debugging. To recovery you can just reboot the system.

cjcain commented 5 years ago

In email from @madscientist159 (regarding testing the manual OCC resets - see my last comment):

Yes, it was able to recover, though if it kept having to be reset eventually the host would hang (some weeks / months of OCC faults and resets).

Did the traces shed any light on why the OCC was going into safe mode in the first place?

First I want to confirm:

You forced 3 reset requests and saw successful recovery (all OCC went back to Active state) afer each reset
After the 4th reset you should have seen the system stay in safe mode and an eSEL should get generated and sent to the BMC indicating that fact.

Regarding the traces from this issue:

The initial failure you hit did have a fix and if you have applied Rahul's patch to your driver (open-power/hcode@cade833) and continue to see the system enter safe mode, then we will need to see additional data to debug the new failure. As Rahul requested: We need both the BMC FFDC data as well as the opal traces from the host after you notice a system enter safe mode.

If you are seeing the host hang, you can still reboot the host and then collect the data (since it should not be lost across reboots).

There should always be an eSEL that is generated and sent to the BMC anytime the system is left in safe mode. This is what we use to determine what triggered safe mode. I did not see any eSELs listed in this issue.

Since the build that is being used on your systems is missing many fixes/commits, we need additional data to try to narrow down what you are seeing and if there are additional commits you need to pick up.

marthabroyles commented 5 years ago

@madscientist159 Let's make sure you have all the known fixes for the reason the reset is being asked for in the first place (PGPE VFRT timeout) there is another PGPE fix for this issue since the previous level @rbatraAustinIBM pointed you to on July 11th. @rbatraAustinIBM is the latest VFRT timeout fix available? Please update here the fix they need to pull in.
If resets are still happening with this additional fix then we will need the additional info requested, both the BMC FFDC data as well as the opal traces from the host after you notice a system enter safe mode.

rbatraAustinIBM commented 5 years ago

We recently fixed a problem with similar fail signature. However, that commit is not yet available on openpower github. Once, it's available, I will notify.

Also, hold on trying to pull anymore FFDC until then

madscientist159 commented 4 years ago

@rbatraAustinIBM Is the commit available yet? We're seeing this failure at a customer site post DD2.3 upgrade; it's knocking the fan controls offline and providing a poor customer experience.

rbatraAustinIBM commented 4 years ago

@madscientist159 Yes, the commit should be available now. Can you tell what op driver level is this machine on?

madscientist159 commented 4 years ago

We build the firmware from source, so our driver level usually doesn't match the IBM driver levels. Can you point to the commit so that we can pull that revision and test?

madscientist159 commented 4 years ago

@rbatraAustinIBM Were you able to locate this information? We have successfully replicated the customer site failure in our lab on DD2.3 CPUs; it appears to be triggered by compilation heavy workloads.

rbatraAustinIBM commented 4 years ago

Hi, sorry for the delayed response. I didn't see your previous response until now. Here are the commits.

https://rchgit01.rchland.ibm.com/gerrit1/#/c/84277/(PGPE: Multiple Droop Handling Fix) https://rchgit01.rchland.ibm.com/gerrit1/#/c/85148/(PGPE: Fix bad volt updt upon nack)

rbatraAustinIBM commented 4 years ago

Also, note, these commits may or may not fix your problem. If, not, then we have a different problem.

open-power / occ

[POWER9] OCC locks up over time if power saving modes enabled #26