open-power / hostboot

System initialization firmware for Power systems
Apache License 2.0
75 stars 97 forks source link

"esel" tool not available #174

Open madscientist159 opened 5 years ago

madscientist159 commented 5 years ago

For the past couple of years debugging hostboot faults has been made unnecessary hard for OEMs due to the errl tool not being available. This omission gives OEMs two choices: 1.) Revert to "shotgun debugging" (guess, modify code, insert debug printf()s, rebuild, test, repeat) -- very slow and expensive 2.) Put IBM engineers in the critical path for debugging crashes -- again, relatively slow

We need some way of analysing HBEL dumps to get origin source line numbers.

dcrowell77 commented 5 years ago

The most important parts of the error log (failing module, return code) are output to the console. Note that these are deliberately not line numbers which makes them mostly build independent. So for the vast majority of failures that gets you to the exact point of the failure without any extra tooling, just the SOL console.

The big exception here is crashes (i.e. segfaults), and those are problematic even with the full esels. The printk output is part of the log, and since it is plain ascii it should be pretty obvious to read in even a raw (unparsed) log from the BMC. With the printk and the build artifacts you can usually walk the backtrace of the failure. Even internally we don't have any data beyond that for Hostboot crashes.

However, your point is valid that having the error log parser would be helpful. There is a project out there to externalize that - https://github.com/open-power/errl . Unfortunately the person behind this work left us awhile back so I think the momentum may have slowed a bit... I'll try to figure out who has the ball now to get this fully integrated into op-build.

madscientist159 commented 5 years ago

Understood. As you mentioned, this is mainly useful in the context of crashes, which I agree with -- we only really needed this tool when part of hostboot was crashing. I've started some initial documentation on how to parse the records without errl here https://wiki.raptorcs.com/wiki/Hostboot_Debug_Howto but as you can see it's a labor intensive process and we're throwing away a lot of data that may or may not be incidentally helpful in the process.

dcrowell77 commented 5 years ago

The esel/errl parser doesn't provide a huge amount of value for crashes. You'll get things in a slightly more readable format, but the only useful content is pretty much the printk with the backtrace that you have to manually decode. That is what we do internally as well.

dcrowell77 commented 5 years ago

@sampmisr is now driving the errl work.

madscientist159 commented 5 years ago

You'll get things in a slightly more readable format, but the only useful content is pretty much the printk with the backtrace that you have to manually decode. That is what we do internally as well.

Good to know. We might work on tooling to make this process easier.

artemsen commented 4 years ago

We had the same problem, so wrote own errl-like utility for decode HBEL: https://github.com/YADRO-KNS/openpower-esel-parser