Open madscientist159 opened 5 years ago
0505 is a task crash, the error log should contain the printk output if you can't get it directly from memory (not sure what debug environment you have available). Just to confirm, you are using HBB/HBI/HBD from the same build, right? There are issues if you try to mix any of these.
This regression appears to date from late 2018.
What exactly does this comment mean?
This regression appears to date from late 2018.
What exactly does this comment mean?
It means that the fault originated in late 2018, and persisted to the current master version. After bisect, https://github.com/open-power/hostboot/commit/63a9aa53dcf3c7ddcff756e694ddcff246b5c9f2 is the culprit. Working on a fix.
@dcrowell77 4288e39 was a temporary breakage that interfered with bisect, the actual culprit was 63a9aa5.
As far as I can tell there is still no open errl utility (https://github.com/open-power/hostboot/issues/174), so the stack trace in HBEL is relatively useless to us.
The stack trace is manually parseable from the hbicore.list.bz2. Since you are crashing the backtrace is in the printk (as ascii text in the error log) so you should be able to get all the numbers. Even with a esel parser, you won't get more than that for this type of error.
I'm very confused by how 63a9aa5 could be causing the issues you are seeing unless you are trying to apply different levels of HB code on top of an older HBD partition. There would definitely be a dependency caused by this change between the contents of HBD and the HBB/HBI code. Be sure to do a machine-xml-rebuild while you are narrowing things down.
@dcrowell77 Yes, I'm doing a full build inside op-build each time. I've also spent a bit of time earlier today piecing together a hack-around not having errl, which I've started some initial documentation on at https://wiki.raptorcs.com/wiki/Hostboot_Debug_Howto.
The failure in master is in:
#0: TARGETING::Target::_getAttrPtr(TARGETING::ATTRIBUTE_ID, void*&) const
#1: TARGETING::Target::_tryGetAttr(TARGETING::ATTRIBUTE_ID, unsigned int, void*) const
#2: TARGETING::PredicateAttrVal<(TARGETING::ATTRIBUTE_ID)259701628>::operator()(TARGETING::Target const*) const
#3: RangeFilter<TARGETING::_TargetIterator<TARGETING::Target*> >::advance()
#4: TARGETING::RangeFilter<TARGETING::_TargetIterator<TARGETING::Target*> >::advanceIfNoMatch()
#5: TARGETING::TargetService::toTarget(TARGETING::EntityPath const&) const
#6: TARGETING::TargetService::_setTopLevelTarget()
I'm trying to get more detailed information, but it does seem to be the same crash that 63a9aa5 introduces. It could very well be a misbuilt table, e.g. if something is messing up in the Python xmltohb code, but I'm not sure yet.
While the patch in https://github.com/open-power/hostboot/pull/175 allows the system to not immediately crash during early IPL, we still run into a crash later on that looks to be attribute related again:
51.28417|ISTEP 8. 7 - host_p9_eff_config_links
51.31473|ISTEPS_TRACE|>>call_host_p9_fbc_eff_config_links entry
51.32436|ISTEPS_TRACE|>>fapiHWPCallWrapper (p9_fbc_eff_config_links) entry
51.32488|ISTEPS_TRACE|Running p9_fbc_eff_config_links HWP on target HUID 00050000
51.32829|FAPI|p9_fbc_smp_utils.C: (PROC): Start
51.34203|TARG|[TARG] >> TARGETING::getPeerTargets
51.34213|ERRL|E>Error created : PLID=90000043, RC=0506, Mod=01, Userdata=00000000000000D2 0000000000000001, Sev=CRITICAL_SYS_TERM
51.34215|ERRL|>>addProcedureCallout(0x55, 0x6)
51.34219|INITSVC|ERROR : task_wait_tid(0xd2). 'host_p9_eff_config_links', l_tidretrc=0xd2, l_childsts=0x1
51.34730|INITSVC|E>doIstep: Istep failed, plid 0x90000043
51.34732|ERRL|I>Got an error log Msg - Type: 0x00000039
51.34733|ERRL|I>Flush message received
51.34735|INITSVC|<<doIstep: step 8, substep 7
Has hostboot been fully built and tested on a Romulus in the past ~5 months or so?
Has hostboot been fully built and tested on a Romulus in the past ~5 months or so?
Yes, we build and run on a Romulus more or less every day.
51.34203|TARG|[TARG] >> TARGETING::getPeerTargets
Having that trace right before the crash makes me suspect we've got a null target situation maybe.
Re: your debug wiki - a couple comments:
@dcrowell77 Thanks for the info on the BMC and the hbicore.list.bz2 file! Before we duplicate effort, is there an existing tool that matches address to line number without errl, or is this something that we should go ahead and write a tool to do?
WIth regard to this particular crash, here's the final analysis (we now have a Talos II booting on the latest hostboot):
We'll need to test upgrading a variety of machines from the older SBE version on this side, as there is significant concern of machines in the wild not being upgradeable without some way of forcibly writing a new SBE to the processor modules.
is there an existing tool that matches address to line number
https://github.com/open-power/hostboot/blob/master/src/build/debug/Hostboot/Ps.pm is what we use to decode the stack traces for all of our tasks if we can get at all of memory. That does lookups in the listing file and decodes it to source lines. We've never had time to adapt the logic to run against the error log contents from printk. https://github.com/open-power/hostboot/blob/master/src/usr/errl/backtrace.C has similar logic that is used when decoding a backtrace from an error log. All the pieces should be there for you between those.
This particular crash appears to have been triggered by trying to run the latest PNOR images using the old SBE.
This is something we try to avoid, but we've definitely had issues before if something slips through. We do realize that if you can't get into Hostboot to do the update that you are quite stuck...
I'm not quite seeing how the PEER_TARGET piece fits in yet though. Do you know what level the SBE was in the fail case?
@dcrowell77 Previous SBE would have been https://github.com/open-power/sbe/commit/3c633a6c2cd626dafef63f40bfd611016b05ecc7 . I honestly don't know why the hostboot changes would have caused the SBE / HBBL to halt, but as long as we don't see it when the new production firmware files are ready for release (and can therefore update the SBE to a more modern version in situ) it's probably not worth tracking down. The old SBE in question has very limited debugging capability compared to the new one.
@rajadas2 - Please help take a look at the SBE differences.
@dcrowell77 I was just informed today that at least some of the CPUs we have in inventory have an SBE version preloaded from the factory that is failing to IPL with our latest firmware. This is a serious problem; we have no way to feasibly mass-update the SBE version on the CPUs in inventory.
Was there any progress on finding the fault?
EDIT: If it helps I can give date codes for the working & non-working CPUs; I assume there is some way to associate that with the exact factory SBE revision?
EDIT2: We just found out that the date code on the box has no relation to the SBE version inside the box. Two boxes with the same datecode had wildly varying SBE versions inside, so it looks like this would have to be traced via processor serial number.
Can you post the console output from a failing boot? That should give us the SBE and Hostboot code levels that aren't working well together.
@dcrowell77 The only output is the hostboot initialization w/ version, then "Error created : PLID=90000002". Each time the system reboots the PLID increments by two, but my understanding is this is related to the error log count being incremented. The SBE versions that cause the problem are not configured from the factory with LPC debug enabled; I suspect they predate SBE console support entirely.
I'd like to see all the trace output up to the fail. There might be something hiding in there that triggers a thought. Can you save it off to a file and attach it?
On Talos II systems (similar to Romulus), the targeting process aborts during IPL:
This regression appears to date from late 2018.