Open geissonator opened 5 years ago
Curious. Not sure who's maintaining btbridge these days, I can have a look.
And according to the maintainers file, it's me!
This should help a little: https://gerrit.openbmc-project.xyz/c/openbmc/btbridge/+/21283
We've been working a very intermittent issues internally for a while now where hostboot appears to get a timeout from an IPMI command to the BMC. One fix from the host was to ensure as they queue up IPMI messages to send to the BMC, they adjust the timeout accordingly per message - https://github.com/open-power/hostboot/commit/ecf2201cee8cdd3e6eca7d56897fbdf108e59bf5
In most cases, this works because most messages are synchronous to the BMC so having the BMC use a constant 5 second timeout for each message works fine. But, we seem to have hit a corner case where an asynchronous message from the host is causing a timeout, which then affects another command in flight. Here's the trace with extra debug enabled of the failure:
The trace seems to indicate that the 0xbf msg timed out, which then caused the next command to immediately time out (seq 0xc0). I don't know enough about btbridge code but would like to understand if my assessment above is correct and if so, can we ensure the timeout value for each command is not affected by the previous command?
We're still trying to get to root cause on why 0xbf timed out (it was a watchdog reset command). But it causing the 0xc0 to fail (hiomap), caused the host to fail to boot.
The host has no retries built in, we're wondering if that is something we should get added in that code as well.