open-power / hostboot

System initialization firmware for Power systems
Apache License 2.0
75 stars 97 forks source link

All PRD syslog entries aren't understandable by humans #64

Closed ghost closed 7 years ago

ghost commented 8 years ago

I use humans, plural, in the title as I believe there's only one that could make sense of this kind of log messages.

I think the following decodes to "Processor Runtime Diagnostics detected hardware failure", but it should be a lot more obvious to me (and an end user) as to what on earth this means.

Sep  3 00:18:25 YC01UNOS opal-prd: SCOM: read: chip 0x80000001, addr 0x201140c, val 0xc8c0001000000000
Sep  3 00:18:25 YC01UNOS opal-prd: SCOM: read: chip 0x0, addr 0x20118c0, val 0xc00000000000
Sep  3 00:18:25 YC01UNOS opal-prd: SCOM: read: chip 0x0, addr 0x20118c3, val 0x79163401a47d3c00
Sep  3 00:18:25 YC01UNOS opal-prd: SCOM: read: chip 0x0, addr 0x20118c6, val 0x9c00000000000
Sep  3 00:18:25 YC01UNOS opal-prd: SCOM: read: chip 0x0, addr 0x20118c7, val 0x8ee00a9018800000
Sep  3 00:18:25 YC01UNOS opal-prd: SCOM: read: chip 0x0, addr 0x20118c8, val 0xc00000000000
Sep  3 00:18:25 YC01UNOS opal-prd: SCOM: read: chip 0x0, addr 0x201189e, val 0x0
Sep  3 00:18:25 YC01UNOS opal-prd: SCOM: read: chip 0x0, addr 0x20118ce, val 0x0
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: PRDF:I>PRD Signature 00040001 FBF70000
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: PRDF:I>PRD Signature 00040001 FBF70000
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: PRDF:I>PRD Signature 00040001 FBF70001
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: PRDF:I>PRD Signature 00040001 FBF70004
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: PRDF:I>PRD Signature 00040001 FBF70008
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: PRDF:I>PRD Signature 00040001 FBF70009
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: PRDF:I>PRD Signature 00040001 FBF7001B
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: PRDF:I>PRD Signature 00040001 FBF7DD81
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: PRDF:I>PRD Signature 00040001 FBF70000
Sep  3 00:18:25 YC01UNOS opal-prd: SCOM: write: chip 0x0, addr 0x20118c1, val 0xfff63fffffffffff
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: ERRL:>>addHwCallout(0x00040001 0x1 0x0 0x0)
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: TARG:[TARG] E> Number of Parent chip is not 1, but 0
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: ERRL:E>ErrlEntry::collectTrace(): getBuffer(prdf) rets zero.
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: ERRL:commitErrLog() called by E500 for plid=0x8900082D,Reasoncode=E504
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: ERRL:I>Send an error log to hypervisor to commit. plid=0x8900082D
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: ERRL:>>saveErrLogToPnor eid=8900082d
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: ERRL:I>saveErrLogToPnor: INFORMATIONAL/RECOVERED log, skipping
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: ERRL:<<saveErrLogToPnor returning true
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: ERRL:I>Send msg to BMC for errlogId [0x8900082d]
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: ERRL:>>sendErrLogToBmc errlogId 0x8900082d, i_sendSels 1
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: ERRL:I>sendErrLogToBmc: 8900082D is INFORMATIONAL/RECOVERED; skipping
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: ERRL:<<sendErrLogToBmc
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: ERRL:<<sendToHypervisor()
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: PRDF:I>[ErrDataService::GenerateSrcPfa] PRD called to analyze an error: 0x00040001 0xfbf70000
Sep  3 00:18:25 YC01UNOS opal-prd: HBRT: PRDF:<<PRDF::main()
Sep  3 00:18:25 YC01UNOS opal-prd: SCOM: read: chip 0x0, addr 0x2000001, val 0x198000000000000
dcrowell77 commented 8 years ago

The average end user really has no need to decode the specific FIR bits from a PRD log. Those messages will always produce a SEL pointing to bad hardware (or whatever) to address the issue. I don't think the structure of the PRD code lends itself easily to including the FIR bit decodes in human-readable form. Remember also that many FIR bit descriptions won't mean anything to the average user.

However, I can see the benefit, so assigning to @zane131 to look into a potential improvement.

zane131 commented 7 years ago

@stewart-ibm Dan's statement is correct. It is just not feasible of practical at this point. All of the human readable information will be available in the eSEs.