Open geissonator opened 2 years ago
Unless someone can pinpoint the bug causing the intermittent file size 0 issue, I think our best bet at this point is to at least gracefully recover from the error.
So either we should add a "else if" at https://github.com/openbmc/phosphor-host-ipmid/blob/master/user_channel/channel_mgmt.cpp#L1111 that confirms the returned "data" is non-zero in size (and deletes file and returns -EIO if it is invalid) or we should add code in the exception clauses to delete the invalid file. It may be best to do both.
In summary, If the file is 0 in size or throws an exception during parsing, delete the file and throw the exception.
Testing is simple, load your code change and make an empty size file and restart ipmid to ensure it recovers.
rm /var/lib/ipmi/channel_access_nv.json
touch /var/lib/ipmi/channel_access_nv.json
systemctl restart phosphor-ipmi-host.service
@geissonator May I know what physical storage you are using for filesystem? flash part or eMMC? TIA
@geissonator May I know what physical storage you are using for filesystem? flash part or eMMC? TIA
We've seen this on both AST2500 (NOR chip) and AST2600 (eMMC). It recently resurfaced in our latest release on an AST2600.
We at IBM have seen this intermittently over the years. We've seen on our older witherspoon and mowgli systems (AST2500) but also on our new p10bmc machines (AST2600). It's very intermittent though.
The first symptom you see is this in the journal:
When you look at the file in question, /var/lib/ipmi/channel_access_nv.json, it's 0 in size:
I'm not sure how this file could end up being 0 size, but it does seem like a simple workaround is in the error path, https://github.com/openbmc/phosphor-host-ipmid/blob/master/user_channel/channel_mgmt.cpp#L1146, to just remove the file. That way when ipmi restarts, it will just re-init the files. Thoughts? I can throw up a quick patch if it make sense.