Closed shenki closed 5 years ago
uint32_t RsvdTraceBuffer::getAvailableSpace(uint32_t i_spaceNeeded, 30b0 000110b0: e9 0a 00 10 ld r8,16(r10) getAlignedSizeOfEntry(i_entry->size) -1); };
bool RsvdTraceBuffer::removeOldestEntry() 3250 00011250: 4b ff fe 41 bl 11090 <TRACE::RsvdTraceBuffer::getAvailableSpace(unsigned int, char*&)> 3254 00011254: 60 00 00 00 nop while (l_spaceAvailable < i_spaceNeeded)
I can't see anything glaringly obvious in the code so we'll probably need to recreate it with some more logging to see where things go wrong. It is odd to see this crash since it is working under PHYP and also has been running fine for awhile in other environments. I have no idea why it is crashing in this specific environment.
Is opal-prd v6.1 newish?
I can't see anything glaringly obvious in the code so we'll probably need to recreate it with some more logging to see where things go wrong.
Do you have instructions for how to turn on more logging?
Is opal-prd v6.1 newish?
It's from July, but the prd codebase has not seen much action so I would consider it newish. I was able to recreate the crash with master too.
Do you have instructions for how to turn on more logging?
It would involve modifying code, there isn't anything more useful to enable. Just your basic debug-via-printf strategy.
I was able to recreate the crash with master too.
If you back-level opal-prd, does it still fail? It seems like that might be the trigger more than the firmware level.
I went back to one of the earliest releases, 5.1.0 from August 2015, and it still happens.
I hit this issue on witherspoon today with upstream master.
I'm using latest opal-prd
code. This is failing deep inside HBRT code. I think its HBRT issue.
-Vasant
@dcrowell77 any update?
-Vasant
Opened bugzilla -> https://bugzilla.linux.ibm.com/show_bug.cgi?id=175251
This blocks op-test-framework from running default suite for CI.
From Deb's data, I see this:
Feb 4 14:03:38 bstn007p1 opal-prd: IMAGE: hbrt map at 0x7321fc540000, size 0x4c0000
Feb 4 14:03:38 bstn007p1 kernel: [ 637.267558] opal-prd[5792]: unhandled signal 11 at 00007cc782c3a018 nip 00007321fc5510b0 lr 00007321fc551254 code 1
// If the list is empty, then the full buffer is available
if (isListEmpty())
30a8 000110a8: 2f aa 00 00 cmpdi cr7,r10,0
30ac 000110ac: 41 9e 00 84 beq cr7,11130 <TRACE::RsvdTraceBuffer::getAvailableSpace(unsigned int, char*&)+0xa0>
// Cache some useful data for easy calculations later on
uintptr_t l_bufferBeginningBoundary = getAddressOfPtr(iv_bufferBeginningBoundary);
uintptr_t l_bufferEndingBoundary = getAddressOfPtr(iv_bufferEndingBoundary);
Entry* l_head = getListHead();
uintptr_t l_headAddr = getAddressOfPtr(l_head);
uintptr_t l_tailAddrEnd = getEndingAddressOfEntry(l_head->prev);
30b0 000110b0: e9 0a 00 10 ld r8,16(r10)
getAlignedSizeOfEntry(i_entry->size) -1); };
This code is trying to grab a chunk of memory that we left around on the previous execution of hbrt so that we can see traces and possibly figure out why we crashed. This code path is fairly new so that would explain why it started showing up recently.
Feb 4 14:03:38 bstn007p1 opal-prd: HBRT: TARG:<<hb_get_rt_rsvd_mem(0x5452414345425546, 0, 65536) -> 0x00007321FC52A000 Feb 4 14:03:38 bstn007p1 opal-prd: HBRT: >> RsvdTraceBufService::retrieveDataFromLastCrash Feb 4 14:03:38 bstn007p1 opal-prd: HBRT: ERRL:>>ErrlManager::ErrlManager constructor.
Above shows us getting a pointer to the section so that seems valid. We have checks to avoid looking at null data, but clearly something is going wrong. This code is not in the op920/op910 branches so it hasn't been tested by that work, but it has been in master for 6+ months now running under PHYP with no issues.
I came across this commit, which may be associated:
commit ff576aa8187b47f61f902b6a097693d00c937d4c Author: Vasant Hegde hegdevasant@linux.vnet.ibm.com Date: Mon Jul 30 15:28:46 2018 +0530
opal-prd: Fix opal-prd crash
Presently callback function from HBRT uses r11 to point to target function
pointer. r12 is garbage. This works fine when we compile with "-no-pie" option
(as we don't use r12 to calculate TOC).
As per ABIv2 : "r12 : Function entry address at global entry point"
With "-pie" compilation option, we have to set r12 to point to global function
entry point. So that we can calculate TOC properly.
Crash log without this patch:
opal-prd[2864]: unhandled signal 11 at 0000000000029320 nip 00000 00102012830 lr 0000000102016890 code 1
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
CC: Jeremy Kerr <jk@ozlabs.org>
CC: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Jeremy Kerr <jk@ozlabs.org>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
@debmc above opal-prd commit is already in master and I believe most of the recent distro also has this fix. so we are good from opal-prd side.
-Vasant
@dcrowell77 I'm not sure why you are not hitting this issue on PHYP. But we have been hitting this issue with hostboot master for sometime. Its easy to reproduce.. Just install recent firmware and start opal-prd. It crashes.
-Vasant
Dan notifications@github.com writes:
From Deb's data, I see this:
Feb 4 14:03:38 bstn007p1 opal-prd: IMAGE: hbrt map at 0x7321fc540000, size 0x4c0000
Feb 4 14:03:38 bstn007p1 kernel: [ 637.267558] opal-prd[5792]: unhandled signal 11 at 00007cc782c3a018 nip 00007321fc5510b0 lr 00007321fc551254 code 1
// If the list is empty, then the full buffer is available if (isListEmpty()) 30a8 000110a8: 2f aa 00 00 cmpdi cr7,r10,0 30ac 000110ac: 41 9e 00 84 beq cr7,11130 <TRACE::RsvdTraceBuffer::getAvailableSpace(unsigned int, char*&)+0xa0> // Cache some useful data for easy calculations later on uintptr_t l_bufferBeginningBoundary = getAddressOfPtr(iv_bufferBeginningBoundary); uintptr_t l_bufferEndingBoundary = getAddressOfPtr(iv_bufferEndingBoundary); Entry* l_head = getListHead(); uintptr_t l_headAddr = getAddressOfPtr(l_head); uintptr_t l_tailAddrEnd = getEndingAddressOfEntry(l_head->prev); 30b0 000110b0: e9 0a 00 10 ld r8,16(r10) getAlignedSizeOfEntry(i_entry->size) -1); };
This code is trying to grab a chunk of memory that we left around on the previous execution of hbrt so that we can see traces and possibly figure out why we crashed. This code path is fairly new so that would explain why it started showing up recently.
Feb 4 14:03:38 bstn007p1 opal-prd: HBRT: TARG:<<hb_get_rt_rsvd_mem(0x5452414345425546, 0, 65536) -> 0x00007321FC52A000 Feb 4 14:03:38 bstn007p1 opal-prd: HBRT: >> RsvdTraceBufService::retrieveDataFromLastCrash Feb 4 14:03:38 bstn007p1 opal-prd: HBRT: ERRL:>>ErrlManager::ErrlManager constructor.
Above shows us getting a pointer to the section so that seems valid. We have checks to avoid looking at null data, but clearly something is going wrong. This code is not in the op920/op910 branches so it hasn't been tested by that work, but it has been in master for 6+ months now running under PHYP with no issues.
I'd guess it's been broken for about 6 months now.
-- Stewart Smith OPAL Architect, IBM.
Any ETA on a fix? This is gating tagging op-build v2.2
I'd guess it's been broken for about 6 months now. … -- Stewart Smith OPAL Architect, IBM.
@stewart-ibm @shenki What test was done to uncover this problem? Could this problem be detected by autoIPL testing?
Michael Lim notifications@github.com writes:
I'd guess it's been broken for about 6 months now. … -- Stewart Smith OPAL Architect, IBM.
@stewart-ibm @shenki What test was done to uncover this problem? Could this problem be detected by autoIPL testing?
It's in the op-test host suite.
You just need to try and start opal-prd. On most OSs this is done on startup, so you just get to check the status of the opal-prd.service to see it failed.
-- Stewart Smith OPAL Architect, IBM.
@stewart-ibm , I just confirmed with Stephanie that we are running Witherspoon DD2.3 systems from op-build master and she is also running a set of OCC tests, including OCC resets, which required opal-prd. Apparently it is running fine. Checking to see what level of OS and skiboot that she is using for her testing.
She is running with skiboot 6.2.1 and on two different systems, she is running with the following Redhat versions: w51 has: Red Hat Enterprise Linux Server release 7.6 (Maipo) Linux w51L.aus.stglabs.ibm.com 4.14.0-115.2.2.el7a.ppc64le #1 SMP Mon Nov 5 17:28:23 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
wsbmc019 has: Red Hat Enterprise Linux Server release 7.5 (Maipo) Linux ws019os 4.14.0-49.el7a.ppc64le #1 SMP Wed Mar 14 13:58:40 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
Could this be an Ubuntu issue?
She also mentioned that she is very intermittmently seeing: SW452077 IPL: OCC Reset Count is not set to 1 with BC8A2AD3 / BC802AD3 - PGPE_FAILURE on wsbmc010 with 920.1831.20181127a but that is likely not related as that would occur after opal-prd is already up and running.
which required opal-prd
My experience on this is that from a fresh boot opal-prd will be running, if opal-prd dies/stops/killed, you cannot restart it due to the seg fault.
That is an interesting observation, and it might help explain the distro dependency. There is a race condition that was observed between the PNOR driver in Linux and opal-prd. If opal-prd starts before the PNOR logic is ready, we deliberately force a crash in order to get restarted slightly later. I think that this race may have been specific to the init sequence for one of the distros (systemd vs initd maybe?). So what might be happening is that Ubuntu always hits this crash and then is susceptible to the restart bug, whereas Redhat installs start clean and we're fine unless we crash for some other reason.
There shouldn't be a race with any Linux driver for the PNOR, that's going to be up and running by the time opal-prd starts.
Okay, I also just confirmed that on ubuntu 18.04.2 we have the machine boot with opal-prd
running but then if we systemctl stop opal-prd.service
and then systemctl start opal-prd.service
, it'll never run again, always crashing with:
<snip>
HBRT: Initing module centaur_mba.prf...
HBRT: done.
HBRT: Modules initialized.
HBRT: >> RsvdTraceBufService::initRsvdTraceBufService
HBRT: >> RsvdTraceBufService::init
HBRT: TARG:>>hb_get_rt_rsvd_mem(0x5452414345425546, 0)
IMAGE: hservice_get_reserved_mem: ibm,hbrt-data, 0
IMAGE: hservice_get_reserved_mem: ibm,hbrt-data[0](0x0000201ffd550000) address 0x714aea980000
HBRT: TARG:>>hb_find_rsvd_mem_label(0x5452414345425546, 0x714aea980000)
HBRT: TARG:hb_find_rsvd_mem_label: Entry found at offset 0x000000000019E000, size 65536
HBRT: TARG:<<hb_find_rsvd_mem_label(0x5452414345425546, 65536) -> 0x0000714AEAB1E000
HBRT: TARG:<<hb_get_rt_rsvd_mem(0x5452414345425546, 0, 65536) -> 0x0000714AEAB1E000
HBRT: >> RsvdTraceBufService::retrieveDataFromLastCrash
HBRT: ERRL:>>ErrlManager::ErrlManager constructor.
Segmentation fault
There shouldn't be a race with any Linux driver for the PNOR, that's going to be up and running by the time opal-prd starts.
We had a specific defect that was exactly because of pnor not being available. See SW423599 and https://bugzilla.linux.ibm.com/show_bug.cgi?id=165929
And some words from a smart guy I know...
==== State: Open by: sesmith on 02 April 2018 17:50:47 ==== This is 100% a distro bug. We need to wait for the mtd kernel module to have initialized and found the device.
Dan notifications@github.com writes:
There shouldn't be a race with any Linux driver for the PNOR, that's going to be up and running by the time opal-prd starts.
We had a specific defect that was exactly because of pnor not being available. See SW423599 and https://bugzilla.linux.ibm.com/show_bug.cgi?id=165929
And some words from a smart guy I know...
==== State: Open by: sesmith on 02 April 2018 17:50:47 ==== This is 100% a distro bug. We need to wait for the mtd kernel module to have initialized and found the device.
That guy sounds suspicious :)
Looks like a different bug this time though, the distro dependency problem is solved, as on first boot opal-prd is running happily. It's just that we can't restart it.
-- Stewart Smith OPAL Architect, IBM.
In HBRT: >> RsvdTraceBufService::retrieveDataFromLastCrash we generate an error log. If I remove the code generating the error log HBRT init continues and we don't crash.
So @stewart-ibm , knowing that a stop of opal-prd followed by a restart is what's causing the issue. I'd obviously like to continue to debug this but does that lower the severity of this problem such that we could tag a v2.2 of master? Not sure if we have any customer scenarios where opal-prd is stopped and restarted?
I think we're still in bad shape @mzipse because it is broken on some levels of some distros, specifically a newer version (18.10 was the original report, last comment says 18.04.2 works).
@cvswen is closing in on the problem, if it takes another day we have a hack in mind that could be used to disable some functionality but get rid of the crash.
ok. Thanks @dcrowell77 and @cvswen !
Just to add a little more confusion, I just confirmed that we can reboot HBRT inside the PHYP adjunct without any problems, so something is environment specific in some way.
@dcrowell77 Could this problem occur on P8 like B&S system?
The function that is crashing is new in P9 (added last summer).
Maury Zipse notifications@github.com writes:
So @stewart-ibm , knowing that a stop of opal-prd followed by a restart is what's causing the issue. I'd obviously like to continue to debug this but does that lower the severity of this problem such that we could tag a v2.2 of master? Not sure if we have any customer scenarios where opal-prd is stopped and restarted?
You're probably right in that there isn't likely a customer scenario for it and I'd probably be okay with documenting it as a known issue (maybe we can fix in a 2.2.1).
it does mean that all our tests fail though... so it's a pretty big question mark on if any PRD functionality actually works.
-- Stewart Smith OPAL Architect, IBM.
FYI - An early HBRT crash (due to pnor issues again...) was just reported on B&S, to me that means this is not as rare of an occurrance as is being implied...
We figured out the bug, HBRT was mistakenly persisting some pointers across invocations. Since the reserved memory is remapped every time we start, those pointers are going to be invalid. The reason we haven't noticed this issue in PHYP is because their adjunct assigns the same addresses every time, as does our internal CXX testcases where we test a restart as well.
@cvswen is going to push up a change to disable this function so we can get a working version out there while we refactor this code.
Change to remove the failing call is here - https://github.com/open-power/hostboot/commit/80cea86add7ba742181cd272b16e10185b5e9a4d
Power9 running op-build master and Ubutnu 18.10 (opal-prd v6.1):
Crash is here:
Full log