opal-prd segfaults on startup

shenki commented 5 years ago

Power9 running op-build master and Ubutnu 18.10 (opal-prd v6.1):

open-power-romulus-v2.2-rc1-4-g05559c5
    buildroot-2018.08.3-9-gfe059872d7
    skiboot-v6.2-rc2
    hostboot-3f1f218-pea3546c
    occ-12c8088
    linux-4.19.4-openpower1-p5bc2a4d
    petitboot-1.9.2
    machine-xml-d91ade7
    hostboot-binaries-hw111318a.930
    capp-ucode-p9-dd2-v4
    sbe-cf61dc3
    hcode-hw112118a.930

Crash is here:

(gdb) bt
#0  0x00007ffff78510b0 in ?? ()
#1  0x00007ffff7851254 in ?? ()
Backtrace stopped: Cannot access memory at address 0x40e5ffffff7f0010

Full log

(gdb) r
Starting program: /usr/sbin/opal-prd --debug --stdio
CTRL: Starting PRD daemon

[Detaching after fork from child process 42184]
I2C: Found Chip: 00000000 engine 3 port 0
I2C: Found Chip: 00000000 engine 1 port 2
I2C: Found Chip: 00000008 engine 3 port 1
I2C: Found Chip: 00000008 engine 1 port 2
I2C: Found Chip: 00000000 engine 3 port 1
I2C: Found Chip: 00000000 engine 2 port 0
I2C: Found Chip: 00000000 engine 1 port 0
I2C: Found Chip: 00000008 engine 3 port 0
I2C: Found Chip: 00000008 engine 1 port 0
CTRL: Listening on control socket /run/opal-prd-control
FW: 16 PRD ranges, instances assigned by firmware
FW:  0000200ffcb90000-0000200ffcc8ffff HCODE [0]
FW:  0000200ffcf40000-0000200ffd03ffff OCC [0]
FW:  0000200ffcb60000-0000200ffcb8ffff RINGOVD [0]
FW:  0000200ffcc90000-0000200ffcf3ffff WOFDATA [0]
FW:  0000200ffd700000-0000200ffd7fffff ibm,arch-reg-data [0]
FW:  00000000f8000000-0000000107ffffff ibm,hb-rsv-mem [0]
FW:  0000200ffd090000-0000200ffd54ffff ibm,hbrt-code-image [0]
FW:  0000200ffd550000-0000200ffd6fffff ibm,hbrt-data [0]
FW:  0000200ffd800000-0000200ffdbfffff ibm,homer-image [0]
FW:  0000200ffdc00000-0000200ffdffffff ibm,homer-image [8]
FW:  0000200fff800000-0000200fffffffff ibm,occ-common-area [0]
FW:  0000200ffd060000-0000200ffd06ffff ibm,sbe-comm [8]
FW:  0000200ffd080000-0000200ffd08ffff ibm,sbe-comm [0]
FW:  0000200ffd050000-0000200ffd05ffff ibm,sbe-ffdc [8]
FW:  0000200ffd070000-0000200ffd07ffff ibm,sbe-ffdc [0]
FW:  0000200ffd040000-0000200ffd04ffff ibm,secure-crypt-algo-code [0]
FW: Chip init
FW: Chip 0x8
FW: Chip 0x0
IMAGE: dumped HBRT binary to hbrt.bin
IMAGE: hbrt map at 0x7ffff7840000, size 0x4c0000
[Detaching after fork from child process 42185]
[Detaching after fork from child process 42186]
HBRT: calling hservices_init
IMAGE: code address: 0x7ffff7840000
IMAGE: calling ibm,hbrt_init()
HBRT: Starting Runtime Services....

HBRT: Initializing modules.

HBRT:   Initing module libtrace_rt.so...
HBRT: done.

HBRT:   Initing module liberrl_rt.so...
HBRT: done.

HBRT:   Initing module libtargeting_rt.so...
HBRT: TARG:AttrRP::startup
HBRT: TARG:>>hb_get_rt_rsvd_mem(0x4154545220202020, 0)
IMAGE: hservice_get_reserved_mem: ibm,hbrt-data, 0
IMAGE: Mapping 0x0000200ffd550000 0x001b0000 ibm,hbrt-data[0]
IMAGE: hservice_get_reserved_mem: ibm,hbrt-data[0](0x0000200ffd550000) address 0x7ffff7690000
HBRT: TARG:>>hb_find_rsvd_mem_label(0x4154545220202020, 0x7ffff7690000)
HBRT: TARG:hb_find_rsvd_mem_label: Entry found at offset 0x0000000000001000, size 581222
HBRT: TARG:<<hb_find_rsvd_mem_label(0x4154545220202020, 581222) -> 0x00007FFFF7691000
HBRT: TARG:<<hb_get_rt_rsvd_mem(0x4154545220202020, 0, 581222) -> 0x00007FFFF7691000
HBRT: TARG:AttrRP::nodeInfoInit 0
HBRT: TARG:Decoded Attribute Section: 0, 0x100000000, 0x7ffff7691000, 0x4b000
HBRT: TARG:Decoded Attribute Section: 2, 0x108000000, 0x7ffff76dc000, 0x5000
HBRT: TARG:Decoded Attribute Section: 3, 0x110000000, 0x7ffff76e1000, 0x2d000
HBRT: TARG:Decoded Attribute Section: 10, 0x118000000, 0x7ffff770e000, 0x1000
HBRT: TARG:Decoded Attribute Section: 1, 0x120000000, 0x7ffff770f000, 0x3000
HBRT: TARG:Push node struct for Node 0
HBRT: TARG:AttrRP::checkHbExistingImage
HBRT: TARG:Target 2 of 297, Physical:/Sys0, HB images 00
HBRT: TARG:[TARG] >> TARGETING::TargetService::init() 
HBRT: TARG:[TARG] I> Max Nodes to initialize is [1]
HBRT: TARG:[TARG] >> TARGETINGAttrRPgetBaseAddress() 
HBRT: TARG:[TARG] >> TARGETING::TargetService::_configureTargetPool(...) 
HBRT: TARG:[TARG] I> Max targets = 297
HBRT: TARG:[TARG] I> i_nodeContainer.targets = 0x10002f0ac
HBRT: TARG:[TARG] I> i_nodeContainer.targets after translation = 0x7ffff76c00ac
HBRT: TARG:[TARG] << TARGETING::TargetService::_configureTargetPool(...) 
HBRT: TARG:[TARG] I> TargetService::init: Pushing info for node 0 with first target huid 0x00020000 and 297 total targets
HBRT: TARG:[TARG] >> TARGETING::TargetService::_setTopLevelTarget() 
HBRT: TARG:[TARG] << TARGETING::TargetService::_setTopLevelTarget() 
HBRT: TARG:[TARG] << TARGETING::TargetService::init() 
HBRT: TARG:Initial PEER_TARGET address for HUID 0xe0001 found to be 0x7ffff76c4f08
HBRT: TARG:Updated PEER_TARGET address for HUID 0xe0001 found to be 0x7ffff76c4f08
HBRT: TARG:Initial PEER_TARGET address for HUID 0xe0002 found to be 0x0
HBRT: TARG:Updated PEER_TARGET address for HUID 0xe0002 found to be 0x0
HBRT: TARG:Initial PEER_TARGET address for HUID 0xe0004 found to be 0x7ffff76c27fc
HBRT: TARG:Updated PEER_TARGET address for HUID 0xe0004 found to be 0x7ffff76c27fc
HBRT: TARG:Initial PEER_TARGET address for HUID 0xe0005 found to be 0x0
HBRT: TARG:Updated PEER_TARGET address for HUID 0xe0005 found to be 0x0
HBRT: TARG:adjustTargetingForRuntime: 4 peer target addresses translated on 1 nodes
HBRT: TARG:adjustTargetingForRuntime: 30 mutex attributes reset on 1 nodes
HBRT: done.

HBRT:   Initing module libutil_rt.so...
HBRT: done.

HBRT:   Initing module libdevicefw_rt.so...
HBRT: done.

HBRT:   Initing module libxscom_rt.so...
HBRT: DEVFW:>>Associator::Associator
HBRT: DEVFW:Device route registered for (4294967295, 23, 5)
HBRT: done.

HBRT:   Initing module libfsiscom_rt.so...
HBRT: DEVFW:Device route registered for (4294967295, 25, 4)
HBRT: done.

HBRT:   Initing module libibscom_rt.so...
HBRT: DEVFW:Device route registered for (4294967295, 26, 4)
HBRT: done.

HBRT:   Initing module libscom_rt.so...
HBRT: DEVFW:Device route registered for (4294967295, 0, 5)
HBRT: DEVFW:Device route registered for (4294967295, 0, 4)
HBRT: DEVFW:Device route registered for (4294967295, 0, 75)
HBRT: DEVFW:Device route registered for (4294967295, 0, 68)
HBRT: DEVFW:Device route registered for (4294967295, 0, 6)
HBRT: DEVFW:Device route registered for (4294967295, 0, 13)
HBRT: DEVFW:Device route registered for (4294967295, 0, 11)
HBRT: DEVFW:Device route registered for (4294967295, 0, 14)
HBRT: DEVFW:Device route registered for (4294967295, 0, 15)
HBRT: DEVFW:Device route registered for (4294967295, 0, 36)
HBRT: DEVFW:Device route registered for (4294967295, 0, 7)
HBRT: DEVFW:Device route registered for (4294967295, 0, 44)
HBRT: DEVFW:Device route registered for (4294967295, 0, 35)
HBRT: DEVFW:Device route registered for (4294967295, 0, 37)
HBRT: DEVFW:Device route registered for (4294967295, 0, 38)
HBRT: DEVFW:Device route registered for (4294967295, 0, 39)
HBRT: DEVFW:Device route registered for (4294967295, 0, 40)
HBRT: DEVFW:Device route registered for (4294967295, 0, 66)
HBRT: DEVFW:Device route registered for (4294967295, 0, 42)
HBRT: DEVFW:Device route registered for (4294967295, 0, 43)
HBRT: DEVFW:Device route registered for (4294967295, 0, 45)
HBRT: DEVFW:Device route registered for (4294967295, 0, 46)
HBRT: DEVFW:Device route registered for (4294967295, 0, 19)
HBRT: DEVFW:Device route registered for (4294967295, 0, 33)
HBRT: DEVFW:Device route registered for (4294967295, 0, 73)
HBRT: DEVFW:Device route registered for (4294967295, 0, 74)
HBRT: DEVFW:Device route registered for (4294967295, 0, 72)
HBRT: done.

HBRT:   Initing module libvpd_rt.so...
HBRT: DEVFW:Device route registered for (0, 6, 5)
HBRT: DEVFW:Device route registered for (1, 6, 5)
HBRT: DEVFW:Device route registered for (0, 7, 4)
HBRT: DEVFW:Device route registered for (1, 7, 4)
HBRT: DEVFW:Device route registered for (0, 13, 2)
HBRT: DEVFW:Device route registered for (1, 13, 2)
HBRT: DEVFW:Device route registered for (0, 17, 11)
HBRT: DEVFW:Device route registered for (1, 17, 11)
HBRT: DEVFW:Device route registered for (0, 5, 3)
HBRT: DEVFW:Device route registered for (1, 5, 3)
HBRT: done.

HBRT:   Initing module libi2c_rt.so...
HBRT: DEVFW:Device route registered for (4294967295, 9, 5)
HBRT: DEVFW:Device route registered for (4294967295, 9, 3)
HBRT: DEVFW:Device route registered for (4294967295, 9, 4)
HBRT: DEVFW:Device route registered for (4294967295, 9, 2)
HBRT: DEVFW:Device route registered for (4294967295, 9, 11)
HBRT: DEVFW:Device route registered for (4294967295, 24, 5)
HBRT: DEVFW:Device route registered for (4294967295, 24, 4)
HBRT: done.

HBRT:   Initing module libhtmgt_rt.so...
HBRT: done.

HBRT:   Initing module libattn_rt.so...
HBRT: done.

HBRT:   Initing module libprdf_rt.so...
HBRT: done.

HBRT:   Initing module libipmi_rt.so...
HBRT: done.

HBRT:   Initing module libpm_rt.so...
IMAGE: hservice_get_reserved_mem: ibm,homer-image, 0
IMAGE: Mapping 0x0000200ffd800000 0x00400000 ibm,homer-image[0]
IMAGE: hservice_get_reserved_mem: ibm,homer-image[0](0x0000200ffd800000) address 0x7ffff7290000
IMAGE: hservice_get_reserved_mem: ibm,homer-image, 8
IMAGE: Mapping 0x0000200ffdc00000 0x00400000 ibm,homer-image[8]
IMAGE: hservice_get_reserved_mem: ibm,homer-image[8](0x0000200ffdc00000) address 0x7ffff6e90000
HBRT: done.

HBRT:   Initing module libpnor_rt.so...
HBRT: done.

HBRT:   Initing module libfapi2_rt.so...
HBRT: done.

HBRT:   Initing module libsecureboot_rt.so...
HBRT: done.

HBRT:   Initing module libp9_cpuWkup.so...
HBRT: done.

HBRT:   Initing module libfsi_rt.so...
HBRT: done.

HBRT:   Initing module libsbeio_rt.so...
IMAGE: hservice_get_reserved_mem: ibm,sbe-comm, 0
IMAGE: Mapping 0x0000200ffd080000 0x00010000 ibm,sbe-comm[0]
IMAGE: hservice_get_reserved_mem: ibm,sbe-comm[0](0x0000200ffd080000) address 0x7ffff7f70000
HBRT: SBEIO:I> COMM_ADDR=0x7FFFF7F70000 for 00050000
IMAGE: hservice_get_reserved_mem: ibm,sbe-comm, 8
IMAGE: Mapping 0x0000200ffd060000 0x00010000 ibm,sbe-comm[8]
IMAGE: hservice_get_reserved_mem: ibm,sbe-comm[8](0x0000200ffd060000) address 0x7ffff7f60000
HBRT: SBEIO:I> COMM_ADDR=0x7FFFF7F60000 for 00050001
HBRT: done.

HBRT:   Initing module libtod_rt.so...
HBRT: done.

HBRT:   Initing module libnestmemutils.so...
HBRT: done.

HBRT:   Initing module libimageprocs_rt.so...
HBRT: done.

HBRT:   Initing module nimbus_proc.prf...
HBRT: done.

HBRT:   Initing module nimbus_eq.prf...
HBRT: done.

HBRT:   Initing module nimbus_ex.prf...
HBRT: done.

HBRT:   Initing module nimbus_ec.prf...
HBRT: done.

HBRT:   Initing module nimbus_capp.prf...
HBRT: done.

HBRT:   Initing module nimbus_pec.prf...
HBRT: done.

HBRT:   Initing module nimbus_phb.prf...
HBRT: done.

HBRT:   Initing module nimbus_obus.prf...
HBRT: done.

HBRT:   Initing module nimbus_xbus.prf...
HBRT: done.

HBRT:   Initing module nimbus_mcbist.prf...
HBRT: done.

HBRT:   Initing module nimbus_mcs.prf...
HBRT: done.

HBRT:   Initing module nimbus_mca.prf...
HBRT: done.

HBRT:   Initing module cumulus_proc.prf...
HBRT: done.

HBRT:   Initing module cumulus_eq.prf...
HBRT: done.

HBRT:   Initing module cumulus_ex.prf...
HBRT: done.

HBRT:   Initing module cumulus_ec.prf...
HBRT: done.

HBRT:   Initing module cumulus_capp.prf...
HBRT: done.

HBRT:   Initing module cumulus_pec.prf...
HBRT: done.

HBRT:   Initing module cumulus_phb.prf...
HBRT: done.

HBRT:   Initing module cumulus_xbus.prf...
HBRT: done.

HBRT:   Initing module cumulus_obus.prf...
HBRT: done.

HBRT:   Initing module cumulus_mc.prf...
HBRT: done.

HBRT:   Initing module cumulus_mi.prf...
HBRT: done.

HBRT:   Initing module cumulus_dmi.prf...
HBRT: done.

HBRT:   Initing module centaur_membuf.prf...
HBRT: done.

HBRT:   Initing module centaur_mba.prf...
HBRT: done.

HBRT: Modules initialized.

HBRT:  >> RsvdTraceBufService::initRsvdTraceBufService

HBRT:  >> RsvdTraceBufService::init

HBRT: TARG:>>hb_get_rt_rsvd_mem(0x5452414345425546, 0)
IMAGE: hservice_get_reserved_mem: ibm,hbrt-data, 0
IMAGE: hservice_get_reserved_mem: ibm,hbrt-data[0](0x0000200ffd550000) address 0x7ffff7690000
HBRT: TARG:>>hb_find_rsvd_mem_label(0x5452414345425546, 0x7ffff7690000)
HBRT: TARG:hb_find_rsvd_mem_label: Entry found at offset 0x000000000019A000, size 65536
HBRT: TARG:<<hb_find_rsvd_mem_label(0x5452414345425546, 65536) -> 0x00007FFFF782A000
HBRT: TARG:<<hb_get_rt_rsvd_mem(0x5452414345425546, 0, 65536) -> 0x00007FFFF782A000
HBRT:  >> RsvdTraceBufService::retrieveDataFromLastCrash

HBRT: ERRL:>>ErrlManager::ErrlManager constructor.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff78510b0 in ?? ()
(gdb) bt
#0  0x00007ffff78510b0 in ?? ()
#1  0x00007ffff7851254 in ?? ()
Backtrace stopped: Cannot access memory at address 0x40e5ffffff7f0010

dcrowell77 commented 5 years ago

uint32_t RsvdTraceBuffer::getAvailableSpace(uint32_t i_spaceNeeded, 30b0 000110b0: e9 0a 00 10 ld r8,16(r10) getAlignedSizeOfEntry(i_entry->size) -1); };

bool RsvdTraceBuffer::removeOldestEntry() 3250 00011250: 4b ff fe 41 bl 11090 <TRACE::RsvdTraceBuffer::getAvailableSpace(unsigned int, char*&)> 3254 00011254: 60 00 00 00 nop while (l_spaceAvailable < i_spaceNeeded)

I can't see anything glaringly obvious in the code so we'll probably need to recreate it with some more logging to see where things go wrong. It is odd to see this crash since it is working under PHYP and also has been running fine for awhile in other environments. I have no idea why it is crashing in this specific environment.

Is opal-prd v6.1 newish?

shenki commented 5 years ago

I can't see anything glaringly obvious in the code so we'll probably need to recreate it with some more logging to see where things go wrong.

Do you have instructions for how to turn on more logging?

Is opal-prd v6.1 newish?

It's from July, but the prd codebase has not seen much action so I would consider it newish. I was able to recreate the crash with master too.

dcrowell77 commented 5 years ago

Do you have instructions for how to turn on more logging?

It would involve modifying code, there isn't anything more useful to enable. Just your basic debug-via-printf strategy.

I was able to recreate the crash with master too.

If you back-level opal-prd, does it still fail? It seems like that might be the trigger more than the firmware level.

shenki commented 5 years ago

I went back to one of the earliest releases, 5.1.0 from August 2015, and it still happens.

hegdevasant commented 5 years ago

I hit this issue on witherspoon today with upstream master.

I'm using latest opal-prd code. This is failing deep inside HBRT code. I think its HBRT issue.

-Vasant

hegdevasant commented 5 years ago

@dcrowell77 any update?

-Vasant

debmc commented 5 years ago

Opened bugzilla -> https://bugzilla.linux.ibm.com/show_bug.cgi?id=175251

This blocks op-test-framework from running default suite for CI.

dcrowell77 commented 5 years ago

From Deb's data, I see this:

Feb 4 14:03:38 bstn007p1 opal-prd: IMAGE: hbrt map at 0x7321fc540000, size 0x4c0000

Feb 4 14:03:38 bstn007p1 kernel: [ 637.267558] opal-prd[5792]: unhandled signal 11 at 00007cc782c3a018 nip 00007321fc5510b0 lr 00007321fc551254 code 1

// If the list is empty, then the full buffer is available
if (isListEmpty())
30a8    000110a8:   2f aa 00 00     cmpdi   cr7,r10,0
30ac    000110ac:   41 9e 00 84     beq     cr7,11130 <TRACE::RsvdTraceBuffer::getAvailableSpace(unsigned int, char*&)+0xa0>
    // Cache some useful data for easy calculations later on
    uintptr_t l_bufferBeginningBoundary = getAddressOfPtr(iv_bufferBeginningBoundary);
    uintptr_t l_bufferEndingBoundary = getAddressOfPtr(iv_bufferEndingBoundary);
    Entry* l_head = getListHead();
    uintptr_t l_headAddr = getAddressOfPtr(l_head);
    uintptr_t l_tailAddrEnd = getEndingAddressOfEntry(l_head->prev);
30b0    000110b0:   e9 0a 00 10     ld      r8,16(r10)
                        getAlignedSizeOfEntry(i_entry->size) -1); };

This code is trying to grab a chunk of memory that we left around on the previous execution of hbrt so that we can see traces and possibly figure out why we crashed. This code path is fairly new so that would explain why it started showing up recently.

Feb 4 14:03:38 bstn007p1 opal-prd: HBRT: TARG:<<hb_get_rt_rsvd_mem(0x5452414345425546, 0, 65536) -> 0x00007321FC52A000 Feb 4 14:03:38 bstn007p1 opal-prd: HBRT: >> RsvdTraceBufService::retrieveDataFromLastCrash Feb 4 14:03:38 bstn007p1 opal-prd: HBRT: ERRL:>>ErrlManager::ErrlManager constructor.

Above shows us getting a pointer to the section so that seems valid. We have checks to avoid looking at null data, but clearly something is going wrong. This code is not in the op920/op910 branches so it hasn't been tested by that work, but it has been in master for 6+ months now running under PHYP with no issues.

debmc commented 5 years ago

I came across this commit, which may be associated:

commit ff576aa8187b47f61f902b6a097693d00c937d4c Author: Vasant Hegde hegdevasant@linux.vnet.ibm.com Date: Mon Jul 30 15:28:46 2018 +0530

opal-prd: Fix opal-prd crash

Presently callback function from HBRT uses r11 to point to target function
pointer. r12 is garbage. This works fine when we compile with "-no-pie" option
(as we don't use r12 to calculate TOC).

As per ABIv2 : "r12 : Function entry address at global entry point"

With "-pie" compilation option, we have to set r12 to point to global function
entry point. So that we can calculate TOC properly.

Crash log without this patch:
  opal-prd[2864]: unhandled signal 11 at 0000000000029320 nip 00000 00102012830 lr 0000000102016890 code 1

Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
CC: Jeremy Kerr <jk@ozlabs.org>
CC: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Jeremy Kerr <jk@ozlabs.org>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>

hegdevasant commented 5 years ago

@debmc above opal-prd commit is already in master and I believe most of the recent distro also has this fix. so we are good from opal-prd side.

-Vasant

hegdevasant commented 5 years ago

@dcrowell77 I'm not sure why you are not hitting this issue on PHYP. But we have been hitting this issue with hostboot master for sometime. Its easy to reproduce.. Just install recent firmware and start opal-prd. It crashes.

-Vasant

ghost commented 5 years ago

Dan notifications@github.com writes:

From Deb's data, I see this:

Feb 4 14:03:38 bstn007p1 opal-prd: IMAGE: hbrt map at 0x7321fc540000, size 0x4c0000

Feb 4 14:03:38 bstn007p1 kernel: [ 637.267558] opal-prd[5792]: unhandled signal 11 at 00007cc782c3a018 nip 00007321fc5510b0 lr 00007321fc551254 code 1
// If the list is empty, then the full buffer is available
if (isListEmpty())
30a8  000110a8:   2f aa 00 00     cmpdi   cr7,r10,0
30ac  000110ac:   41 9e 00 84     beq     cr7,11130 <TRACE::RsvdTraceBuffer::getAvailableSpace(unsigned int, char*&)+0xa0>
    // Cache some useful data for easy calculations later on
    uintptr_t l_bufferBeginningBoundary = getAddressOfPtr(iv_bufferBeginningBoundary);
    uintptr_t l_bufferEndingBoundary = getAddressOfPtr(iv_bufferEndingBoundary);
    Entry* l_head = getListHead();
    uintptr_t l_headAddr = getAddressOfPtr(l_head);
    uintptr_t l_tailAddrEnd = getEndingAddressOfEntry(l_head->prev);
30b0  000110b0:   e9 0a 00 10     ld      r8,16(r10)
                        getAlignedSizeOfEntry(i_entry->size) -1); };
This code is trying to grab a chunk of memory that we left around on the previous execution of hbrt so that we can see traces and possibly figure out why we crashed. This code path is fairly new so that would explain why it started showing up recently.

Feb 4 14:03:38 bstn007p1 opal-prd: HBRT: TARG:<<hb_get_rt_rsvd_mem(0x5452414345425546, 0, 65536) -> 0x00007321FC52A000 Feb 4 14:03:38 bstn007p1 opal-prd: HBRT: >> RsvdTraceBufService::retrieveDataFromLastCrash Feb 4 14:03:38 bstn007p1 opal-prd: HBRT: ERRL:>>ErrlManager::ErrlManager constructor.

Above shows us getting a pointer to the section so that seems valid. We have checks to avoid looking at null data, but clearly something is going wrong. This code is not in the op920/op910 branches so it hasn't been tested by that work, but it has been in master for 6+ months now running under PHYP with no issues.

I'd guess it's been broken for about 6 months now.

-- Stewart Smith OPAL Architect, IBM.

ghost commented 5 years ago

Any ETA on a fix? This is gating tagging op-build v2.2

youhour commented 5 years ago

I'd guess it's been broken for about 6 months now. … -- Stewart Smith OPAL Architect, IBM.

@stewart-ibm @shenki What test was done to uncover this problem? Could this problem be detected by autoIPL testing?

ghost commented 5 years ago

Michael Lim notifications@github.com writes:

I'd guess it's been broken for about 6 months now. … -- Stewart Smith OPAL Architect, IBM.

@stewart-ibm @shenki What test was done to uncover this problem? Could this problem be detected by autoIPL testing?

It's in the op-test host suite.

You just need to try and start opal-prd. On most OSs this is done on startup, so you just get to check the status of the opal-prd.service to see it failed.

-- Stewart Smith OPAL Architect, IBM.

mzipse commented 5 years ago

@stewart-ibm , I just confirmed with Stephanie that we are running Witherspoon DD2.3 systems from op-build master and she is also running a set of OCC tests, including OCC resets, which required opal-prd. Apparently it is running fine. Checking to see what level of OS and skiboot that she is using for her testing.

mzipse commented 5 years ago

She is running with skiboot 6.2.1 and on two different systems, she is running with the following Redhat versions: w51 has: Red Hat Enterprise Linux Server release 7.6 (Maipo) Linux w51L.aus.stglabs.ibm.com 4.14.0-115.2.2.el7a.ppc64le #1 SMP Mon Nov 5 17:28:23 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

wsbmc019 has: Red Hat Enterprise Linux Server release 7.5 (Maipo) Linux ws019os 4.14.0-49.el7a.ppc64le #1 SMP Wed Mar 14 13:58:40 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

Could this be an Ubuntu issue?

mzipse commented 5 years ago

She also mentioned that she is very intermittmently seeing: SW452077 IPL: OCC Reset Count is not set to 1 with BC8A2AD3 / BC802AD3 - PGPE_FAILURE on wsbmc010 with 920.1831.20181127a but that is likely not related as that would occur after opal-prd is already up and running.

debmc commented 5 years ago

which required opal-prd

My experience on this is that from a fresh boot opal-prd will be running, if opal-prd dies/stops/killed, you cannot restart it due to the seg fault.

dcrowell77 commented 5 years ago

That is an interesting observation, and it might help explain the distro dependency. There is a race condition that was observed between the PNOR driver in Linux and opal-prd. If opal-prd starts before the PNOR logic is ready, we deliberately force a crash in order to get restarted slightly later. I think that this race may have been specific to the init sequence for one of the distros (systemd vs initd maybe?). So what might be happening is that Ubuntu always hits this crash and then is susceptible to the restart bug, whereas Redhat installs start clean and we're fine unless we crash for some other reason.

ghost commented 5 years ago

There shouldn't be a race with any Linux driver for the PNOR, that's going to be up and running by the time opal-prd starts.

Okay, I also just confirmed that on ubuntu 18.04.2 we have the machine boot with opal-prd running but then if we systemctl stop opal-prd.service and then systemctl start opal-prd.service, it'll never run again, always crashing with:

<snip>
HBRT:   Initing module centaur_mba.prf...
HBRT: done.

HBRT: Modules initialized.

HBRT:  >> RsvdTraceBufService::initRsvdTraceBufService

HBRT:  >> RsvdTraceBufService::init

HBRT: TARG:>>hb_get_rt_rsvd_mem(0x5452414345425546, 0)
IMAGE: hservice_get_reserved_mem: ibm,hbrt-data, 0
IMAGE: hservice_get_reserved_mem: ibm,hbrt-data[0](0x0000201ffd550000) address 0x714aea980000
HBRT: TARG:>>hb_find_rsvd_mem_label(0x5452414345425546, 0x714aea980000)
HBRT: TARG:hb_find_rsvd_mem_label: Entry found at offset 0x000000000019E000, size 65536
HBRT: TARG:<<hb_find_rsvd_mem_label(0x5452414345425546, 65536) -> 0x0000714AEAB1E000
HBRT: TARG:<<hb_get_rt_rsvd_mem(0x5452414345425546, 0, 65536) -> 0x0000714AEAB1E000
HBRT:  >> RsvdTraceBufService::retrieveDataFromLastCrash

HBRT: ERRL:>>ErrlManager::ErrlManager constructor.
Segmentation fault

dcrowell77 commented 5 years ago

There shouldn't be a race with any Linux driver for the PNOR, that's going to be up and running by the time opal-prd starts.

We had a specific defect that was exactly because of pnor not being available. See SW423599 and https://bugzilla.linux.ibm.com/show_bug.cgi?id=165929

And some words from a smart guy I know...

==== State: Open by: sesmith on 02 April 2018 17:50:47 ==== This is 100% a distro bug. We need to wait for the mtd kernel module to have initialized and found the device.

ghost commented 5 years ago

Dan notifications@github.com writes:

There shouldn't be a race with any Linux driver for the PNOR, that's going to be up and running by the time opal-prd starts.

We had a specific defect that was exactly because of pnor not being available. See SW423599 and https://bugzilla.linux.ibm.com/show_bug.cgi?id=165929

And some words from a smart guy I know...

==== State: Open by: sesmith on 02 April 2018 17:50:47 ==== This is 100% a distro bug. We need to wait for the mtd kernel module to have initialized and found the device.

That guy sounds suspicious :)

Looks like a different bug this time though, the distro dependency problem is solved, as on first boot opal-prd is running happily. It's just that we can't restart it.

-- Stewart Smith OPAL Architect, IBM.

cvswen commented 5 years ago

In HBRT: >> RsvdTraceBufService::retrieveDataFromLastCrash we generate an error log. If I remove the code generating the error log HBRT init continues and we don't crash.

mzipse commented 5 years ago

So @stewart-ibm , knowing that a stop of opal-prd followed by a restart is what's causing the issue. I'd obviously like to continue to debug this but does that lower the severity of this problem such that we could tag a v2.2 of master? Not sure if we have any customer scenarios where opal-prd is stopped and restarted?

dcrowell77 commented 5 years ago

I think we're still in bad shape @mzipse because it is broken on some levels of some distros, specifically a newer version (18.10 was the original report, last comment says 18.04.2 works).

@cvswen is closing in on the problem, if it takes another day we have a hack in mind that could be used to disable some functionality but get rid of the crash.

mzipse commented 5 years ago

ok. Thanks @dcrowell77 and @cvswen !

dcrowell77 commented 5 years ago

Just to add a little more confusion, I just confirmed that we can reboot HBRT inside the PHYP adjunct without any problems, so something is environment specific in some way.

youhour commented 5 years ago

@dcrowell77 Could this problem occur on P8 like B&S system?

dcrowell77 commented 5 years ago

The function that is crashing is new in P9 (added last summer).

ghost commented 5 years ago

Maury Zipse notifications@github.com writes:

So @stewart-ibm , knowing that a stop of opal-prd followed by a restart is what's causing the issue. I'd obviously like to continue to debug this but does that lower the severity of this problem such that we could tag a v2.2 of master? Not sure if we have any customer scenarios where opal-prd is stopped and restarted?

You're probably right in that there isn't likely a customer scenario for it and I'd probably be okay with documenting it as a known issue (maybe we can fix in a 2.2.1).

it does mean that all our tests fail though... so it's a pretty big question mark on if any PRD functionality actually works.

-- Stewart Smith OPAL Architect, IBM.

dcrowell77 commented 5 years ago

FYI - An early HBRT crash (due to pnor issues again...) was just reported on B&S, to me that means this is not as rare of an occurrance as is being implied...

We figured out the bug, HBRT was mistakenly persisting some pointers across invocations. Since the reserved memory is remapped every time we start, those pointers are going to be invalid. The reason we haven't noticed this issue in PHYP is because their adjunct assigns the same addresses every time, as does our internal CXX testcases where we test a restart as well.

@cvswen is going to push up a change to disable this function so we can get a working version out there while we refactor this code.

dcrowell77 commented 5 years ago

Change to remove the failing call is here - https://github.com/open-power/hostboot/commit/80cea86add7ba742181cd272b16e10185b5e9a4d

open-power / hostboot

opal-prd segfaults on startup #155