openbmc / btbridge

Apache License 2.0
6 stars 9 forks source link

timeout does not take into account queue size #14

Open geissonator opened 5 years ago

geissonator commented 5 years ago

We've been working a very intermittent issues internally for a while now where hostboot appears to get a timeout from an IPMI command to the BMC. One fix from the host was to ensure as they queue up IPMI messages to send to the BMC, they adjust the timeout accordingly per message - https://github.com/open-power/hostboot/commit/ecf2201cee8cdd3e6eca7d56897fbdf108e59bf5

In most cases, this works because most messages are synchronous to the BMC so having the BMC use a constant 5 second timeout for each message works fine. But, we seem to have hit a corner case where an asynchronous message from the host is causing a timeout, which then affects another command in flight. Here's the trace with extra debug enabled of the failure:

Apr 24 12:58:23 witherspoon btbridged[30288]: [BTBRIDGED 1556110700.790817149] Completed request with seq 0xbe, netfn 0x07, lun 0x00, cmd 0x06, cc 0x00
Apr 24 12:58:23 witherspoon btbridged[30288]: [BTBRIDGED 1556110700.793924606] Sending dbus signal with seq 0xbf, netfn 0x06, lun 0x00, cmd 0x24
Apr 24 12:58:24 witherspoon btbridged[30288]: [BTBRIDGED 1556110704.899868679] Sending dbus signal with seq 0xc0, netfn 0x3a, lun 0x00, cmd 0x5a
Apr 24 12:58:25 witherspoon btbridged[30288]: [BTBRIDGED 1556110705.927016825] Timeout on msg with seq: 0xbf
Apr 24 12:58:25 witherspoon btbridged[30288]: [BTBRIDGED 1556110705.927279813] Adjusting timer for next element
Apr 24 12:58:25 witherspoon btbridged[30288]: [BTBRIDGED 1556110705.927460055] Successfully wrote 5 of 5 bytes to /dev/ipmi-bt-host
Apr 24 12:58:25 witherspoon btbridged[30288]: [BTBRIDGED 1556110705.927600822] Turning off POLLOUT for the BT in poll()
Apr 24 12:58:25 witherspoon btbridged[30288]: [BTBRIDGED 1556110705.927708377] Completed request with seq 0xbf, netfn 0x07, lun 0x00, cmd 0x24, cc 0xce
Apr 24 12:58:25 witherspoon btbridged[30288]: [BTBRIDGED 1556110705.927831285] Timeout on msg with seq: 0xc0
Apr 24 12:58:25 witherspoon btbridged[30288]: [BTBRIDGED 1556110705.928012375] Disabling timer, no elements remain in queue
Apr 24 12:58:25 witherspoon btbridged[30288]: [BTBRIDGED 1556110705.928195404] Successfully wrote 5 of 5 bytes to /dev/ipmi-bt-host
Apr 24 12:58:25 witherspoon btbridged[30288]: [BTBRIDGED 1556110705.928335565] Turning off POLLOUT for the BT in poll()
Apr 24 12:58:25 witherspoon btbridged[30288]: [BTBRIDGED 1556110705.928442110] Completed request with seq 0xc0, netfn 0x3b, lun 0x00, cmd 0x5a, cc 0xce
Apr 24 12:58:26 witherspoon btbridged[30288]: [BTBRIDGED 1556110705.950386593] Processed 1 dbus events
Apr 24 12:58:26 witherspoon btbridged[30288]: [BTBRIDGED 1556110706.013493351] Processed 1 dbus events

The trace seems to indicate that the 0xbf msg timed out, which then caused the next command to immediately time out (seq 0xc0). I don't know enough about btbridge code but would like to understand if my assessment above is correct and if so, can we ensure the timeout value for each command is not affected by the previous command?

We're still trying to get to root cause on why 0xbf timed out (it was a watchdog reset command). But it causing the 0xc0 to fail (hiomap), caused the host to fail to boot.

The host has no retries built in, we're wondering if that is something we should get added in that code as well.

amboar commented 5 years ago

Curious. Not sure who's maintaining btbridge these days, I can have a look.

amboar commented 5 years ago

And according to the maintainers file, it's me!

amboar commented 5 years ago

This should help a little: https://gerrit.openbmc-project.xyz/c/openbmc/btbridge/+/21283