Targeting process aborts on Talos II systems

madscientist159 commented 5 years ago

On Talos II systems (similar to Romulus), the targeting process aborts during IPL:

  4.24373|TARG|[TARG] >> TARGETING::TargetService::init()
  4.24373|TARG|[TARG] I> Max Nodes to initialize is [1]
  4.24375|TARG|[TARG] >> TARGETING::TargetService::_configureTargetPool(...)
  4.27050|TARG|[TARG] I> Max targets = 297
  4.27051|TARG|[TARG] I> i_nodeContainer.targets = 0x10002eebc
  4.27052|TARG|[TARG] << TARGETING::TargetService::_configureTargetPool(...)
  4.27310|TARG|[TARG] I> TargetService::init: Pushing info for node 0 with first target huid 0x00020000 and 297 total targets
  4.28558|TARG|[TARG] >> TARGETING::TargetService::_setTopLevelTarget()
  4.29866|ERRL|E>Error created : PLID=90000002, RC=0505, Mod=01, Userdata=0000000000000026 0000000000000001, Sev=CRITICAL_SYS_TERM
  4.29868|ERRL|>>addProcedureCallout(0x55, 0x6)
  4.29869|INITSVC|ERROR: task libtargeting.so returned errlog=0x361e98
  4.29870|INITSVC|ExtInitSvc: ERROR: return to initsvc with errlog 0x361e98
  4.29873|INITSVC|ERROR: dispatching task, errorlog=0x0x361e98
  4.29874|INITSVC|InitService: Committing errorlog 0x361e98

This regression appears to date from late 2018.

dcrowell77 commented 5 years ago

0505 is a task crash, the error log should contain the printk output if you can't get it directly from memory (not sure what debug environment you have available). Just to confirm, you are using HBB/HBI/HBD from the same build, right? There are issues if you try to mix any of these.

dcrowell77 commented 5 years ago

This regression appears to date from late 2018.

What exactly does this comment mean?

madscientist159 commented 5 years ago

This regression appears to date from late 2018.

What exactly does this comment mean?

It means that the fault originated in late 2018, and persisted to the current master version. After bisect, https://github.com/open-power/hostboot/commit/63a9aa53dcf3c7ddcff756e694ddcff246b5c9f2 is the culprit. Working on a fix.

madscientist159 commented 5 years ago

@dcrowell77 4288e39 was a temporary breakage that interfered with bisect, the actual culprit was 63a9aa5.

As far as I can tell there is still no open errl utility (https://github.com/open-power/hostboot/issues/174), so the stack trace in HBEL is relatively useless to us.

dcrowell77 commented 5 years ago

The stack trace is manually parseable from the hbicore.list.bz2. Since you are crashing the backtrace is in the printk (as ascii text in the error log) so you should be able to get all the numbers. Even with a esel parser, you won't get more than that for this type of error.

I'm very confused by how 63a9aa5 could be causing the issues you are seeing unless you are trying to apply different levels of HB code on top of an older HBD partition. There would definitely be a dependency caused by this change between the contents of HBD and the HBB/HBI code. Be sure to do a machine-xml-rebuild while you are narrowing things down.

madscientist159 commented 5 years ago

@dcrowell77 Yes, I'm doing a full build inside op-build each time. I've also spent a bit of time earlier today piecing together a hack-around not having errl, which I've started some initial documentation on at https://wiki.raptorcs.com/wiki/Hostboot_Debug_Howto.

The failure in master is in:

#0: TARGETING::Target::_getAttrPtr(TARGETING::ATTRIBUTE_ID, void*&) const
#1: TARGETING::Target::_tryGetAttr(TARGETING::ATTRIBUTE_ID, unsigned int, void*) const
#2: TARGETING::PredicateAttrVal<(TARGETING::ATTRIBUTE_ID)259701628>::operator()(TARGETING::Target const*) const
#3: RangeFilter<TARGETING::_TargetIterator<TARGETING::Target*> >::advance()
#4: TARGETING::RangeFilter<TARGETING::_TargetIterator<TARGETING::Target*> >::advanceIfNoMatch()
#5: TARGETING::TargetService::toTarget(TARGETING::EntityPath const&) const
#6: TARGETING::TargetService::_setTopLevelTarget()

I'm trying to get more detailed information, but it does seem to be the same crash that 63a9aa5 introduces. It could very well be a misbuilt table, e.g. if something is messing up in the Python xmltohb code, but I'm not sure yet.

madscientist159 commented 5 years ago

While the patch in https://github.com/open-power/hostboot/pull/175 allows the system to not immediately crash during early IPL, we still run into a crash later on that looks to be attribute related again:

 51.28417|ISTEP  8. 7 - host_p9_eff_config_links
 51.31473|ISTEPS_TRACE|>>call_host_p9_fbc_eff_config_links entry
 51.32436|ISTEPS_TRACE|>>fapiHWPCallWrapper (p9_fbc_eff_config_links) entry
 51.32488|ISTEPS_TRACE|Running p9_fbc_eff_config_links HWP on target HUID 00050000
 51.32829|FAPI|p9_fbc_smp_utils.C: (PROC): Start
 51.34203|TARG|[TARG] >> TARGETING::getPeerTargets
 51.34213|ERRL|E>Error created : PLID=90000043, RC=0506, Mod=01, Userdata=00000000000000D2 0000000000000001, Sev=CRITICAL_SYS_TERM
 51.34215|ERRL|>>addProcedureCallout(0x55, 0x6)
 51.34219|INITSVC|ERROR : task_wait_tid(0xd2). 'host_p9_eff_config_links', l_tidretrc=0xd2, l_childsts=0x1
 51.34730|INITSVC|E>doIstep: Istep failed, plid 0x90000043
 51.34732|ERRL|I>Got an error log Msg - Type: 0x00000039
 51.34733|ERRL|I>Flush message received
 51.34735|INITSVC|<<doIstep: step 8, substep 7

Has hostboot been fully built and tested on a Romulus in the past ~5 months or so?

dcrowell77 commented 5 years ago

Has hostboot been fully built and tested on a Romulus in the past ~5 months or so?

Yes, we build and run on a Romulus more or less every day.

51.34203|TARG|[TARG] >> TARGETING::getPeerTargets

Having that trace right before the crash makes me suspect we've got a null target situation maybe.

Re: your debug wiki - a couple comments:

The hbicore.list.bz2 is more useful than just the symbol table, it will get you to the exact line of code.
You shouldn't need to get the error logs from HBEL. We send them to the BMC. If you are using openbmc then you can fetch them with: curl -c cjar -b cjar -k -H "Content-Type: application/json" -X GET https://$BMC_IP/xyz/openbmc_project/logging/enumerate

madscientist159 commented 5 years ago

@dcrowell77 Thanks for the info on the BMC and the hbicore.list.bz2 file! Before we duplicate effort, is there an existing tool that matches address to line number without errl, or is this something that we should go ahead and write a tool to do?

WIth regard to this particular crash, here's the final analysis (we now have a Talos II booting on the latest hostboot):

This particular crash appears to have been triggered by trying to run the latest PNOR images using the old SBE.
The (incorrect) patch referenced earlier corrupted the property data enough to avoid the early crash and allow an SBE update. The corrupted property data then caused a crash later on, but the updated SBE was then able to IPL the original, uncorrupted PNOR without a crash.
I suspect this just moved the problem around, but we did note a buried warning during XML parsing. Removing the offending properties in https://git.raptorcs.com/git/talos-xml/commit/?id=217af3fa9f3c541bd995170ddeb26b31a49bc493 seems to allow the new PNOR to IPL without crashing even with the old SBE.

We'll need to test upgrading a variety of machines from the older SBE version on this side, as there is significant concern of machines in the wild not being upgradeable without some way of forcibly writing a new SBE to the processor modules.

dcrowell77 commented 5 years ago

is there an existing tool that matches address to line number

https://github.com/open-power/hostboot/blob/master/src/build/debug/Hostboot/Ps.pm is what we use to decode the stack traces for all of our tasks if we can get at all of memory. That does lookups in the listing file and decodes it to source lines. We've never had time to adapt the logic to run against the error log contents from printk. https://github.com/open-power/hostboot/blob/master/src/usr/errl/backtrace.C has similar logic that is used when decoding a backtrace from an error log. All the pieces should be there for you between those.

dcrowell77 commented 5 years ago

This particular crash appears to have been triggered by trying to run the latest PNOR images using the old SBE.

This is something we try to avoid, but we've definitely had issues before if something slips through. We do realize that if you can't get into Hostboot to do the update that you are quite stuck...

I'm not quite seeing how the PEER_TARGET piece fits in yet though. Do you know what level the SBE was in the fail case?

madscientist159 commented 5 years ago

@dcrowell77 Previous SBE would have been https://github.com/open-power/sbe/commit/3c633a6c2cd626dafef63f40bfd611016b05ecc7 . I honestly don't know why the hostboot changes would have caused the SBE / HBBL to halt, but as long as we don't see it when the new production firmware files are ready for release (and can therefore update the SBE to a more modern version in situ) it's probably not worth tracking down. The old SBE in question has very limited debugging capability compared to the new one.

dcrowell77 commented 5 years ago

@rajadas2 - Please help take a look at the SBE differences.

madscientist159 commented 5 years ago

@dcrowell77 I was just informed today that at least some of the CPUs we have in inventory have an SBE version preloaded from the factory that is failing to IPL with our latest firmware. This is a serious problem; we have no way to feasibly mass-update the SBE version on the CPUs in inventory.

Was there any progress on finding the fault?

EDIT: If it helps I can give date codes for the working & non-working CPUs; I assume there is some way to associate that with the exact factory SBE revision?

EDIT2: We just found out that the date code on the box has no relation to the SBE version inside the box. Two boxes with the same datecode had wildly varying SBE versions inside, so it looks like this would have to be traced via processor serial number.

dcrowell77 commented 5 years ago

Can you post the console output from a failing boot? That should give us the SBE and Hostboot code levels that aren't working well together.

madscientist159 commented 5 years ago

@dcrowell77 The only output is the hostboot initialization w/ version, then "Error created : PLID=90000002". Each time the system reboots the PLID increments by two, but my understanding is this is related to the error log count being incremented. The SBE versions that cause the problem are not configured from the factory with LPC debug enabled; I suspect they predate SBE console support entirely.

dcrowell77 commented 5 years ago

I'd like to see all the trace output up to the fail. There might be something hiding in there that triggers a thought. Can you save it off to a file and attach it?

open-power / hostboot

Targeting process aborts on Talos II systems #173