ibm-openbmc / dev

Product Development Project Mgmt and Tracking
16 stars 2 forks source link

IPMI: IPL: phosphor-ipmi-host.service SEGV core dumps during BMC reboots in FW1050 and FW1060 #3633

Closed jayeshmpatel closed 9 months ago

jayeshmpatel commented 11 months ago

Problem Description IPL: phosphor-ipmi-host.service SEGV core dumps during BMC reboots in FW1050 and FW1060. Seen ipmi-host-service core-dump with any 1050 release fw driver and we think it started after March.

Internal EWM defect # 556751

Steps to Recreate

Driver details: ID=openbmc-openpower NAME="IBM eBMC (OpenBMC for IBM Enterprise Systems)" VERSION="fw1050.00-4.11" VERSION_ID=fw1050.00-4.11-1050.2329.20230711a (NL1050_023) VERSION_CODENAME="mickledore" PRETTY_NAME="IBM eBMC (OpenBMC for IBM Enterprise Systems) fw1050.00-4.11" BUILD_ID="20231117" OPENBMC_TARGET_MACHINE="p10bmc" EXTENDED_VERSION=NL1050_023 BMC_SIGNATURE_TYPE=Development HOST_SIGNATURE_TYPE=Development

Journal log with error traces:

Jul 12 09:16:54 rain27bmc systemd[1]: phosphor-ipmi-host.service: Main process exited, code=dumped, status=11/SEGV Jul 12 09:16:54 rain27bmc systemd[1]: phosphor-ipmi-host.service: Failed with result 'core-dump'.

lxwinspur commented 11 months ago

@jayeshmpatel I tested it with qemu and it works fine. Please provide more Journal logs.(systemd, ipmid, etc.)

jayeshmpatel commented 11 months ago

@anoo1 Do you know what other information can be helpful for recreate/debug of this? I thought this is recreated on every bmc reboot. Does it required system to powered on or powered off and then followed by reboot?

anoo1 commented 11 months ago

This can be recreated by stopping the ipmi service with command systemctl stop phosphor-ipmi-host. It appears the ipmi app is not handling signals gracefully:

root@p10bmc:~# systemctl stop phosphor-ipmi-host
Jul 21 14:46:33 p10bmc systemd[1]: Stopping Phosphor MBOX Daemon...
Jul 21 14:46:34 p10bmc ipmid[564]: Command in process, no attention
Jul 21 14:46:34 p10bmc ipmid[564]: Command in process, no attention
Jul 21 14:46:34 p10bmc systemd[1]: mboxd.service: Deactivated successfully.
Jul 21 14:46:34 p10bmc systemd[1]: Stopped Phosphor MBOX Daemon.
Jul 21 14:46:34 p10bmc ipmid[564]: Received signal; quitting
Jul 21 14:46:34 p10bmc systemd[1]: Stopping Phosphor Inband IPMI...
Jul 21 14:46:34 p10bmc systemd[1]: Created slice Slice /system/systemd-coredump.
Jul 21 14:46:34 p10bmc systemd[1]: Started Process Core Dump (PID 2022/UID 0).
Jul 21 14:46:34 p10bmc systemd-coredump[2023]: elfutils disabled, parsing ELF objects not supported
Jul 21 14:46:34 p10bmc systemd-coredump[2023]: 8;;man:core(5)[LNK]8;; Process 564 (ipmid) of user 0 dumped core.
Jul 21 14:46:34 p10bmc systemd[1]: phosphor-ipmi-host.service: Main process exited, code=dumped, status=11/SEGV
Jul 21 14:46:34 p10bmc systemd[1]: phosphor-ipmi-host.service: Failed with result 'core-dump'.
Jul 21 14:46:34 p10bmc systemd[1]: Stopped Phosphor Inband IPMI.
Jul 21 14:46:34 p10bmc systemd[1]: systemd-coredump@0-2022-0.service: Deactivated successfully.
lxwinspur commented 11 months ago

After discussing with Patrick, still not getting the desired result https://discord.com/channels/775381525260664832/1107848576995954688/1133692956042330163

I double-checked this logic and I really don't know what is the point of adding signal function. I tested, if we remove signal function, or comment out io-stop()[1], ipmid works fine.

[1] https://github.com/openbmc/phosphor-host-ipmid/blob/master/ipmid-new.cpp#L868

@anoo1 @jayeshmpatel @mzipse What do you think?

anoo1 commented 11 months ago

Patrick mentioned that it'll probably be better to try to debug the core dump to see if there's more information before we remove functionality that we don't know the reason it was added. Also he noted that the signal function was added in 2019. It appears at IBM we've started seeing this issue earlier this year around March (from our test logs) so maybe something else changed.

Here is a tool for running gdb on a openbmc core dump: https://github.com/openbmc/openbmc-tools/blob/master/bbdbg/bbdbg

We could also open an issue in the ipmi repository to have one of the maintainers look at this since they haven't responded on Discord.

lxwinspur commented 11 months ago

@anoo1 Opened a new issue: https://github.com/openbmc/phosphor-host-ipmid/issues/191, but got no reply.

It appears at IBM we've started seeing this issue earlier this year around March (from our test logs) so maybe something else changed.

I don't think so, I have rolled back the version to 114669c40ceff8f065850ffa700c877a8823412d(Thu Dec 15 18:20:33 2022) and this issue still exists.

I personally recommend removing the signal function as I don't think it does what we want.

anoo1 commented 9 months ago

Thanks George. I don't have the expertise to say if we should remove the call to stop the io. I've tried surrounding that stop call and the asyncWait call with "if (!io->stopped())" as Patrick mentioned in Discord it could help but didn't fix it. As this is seen during BMC reboot and IPMI is being replaced by Redfish, this is low priority for IBM and no reason to spent time on this beyond the time you've taken to analyze the core and the code and discuss in Discord. Going to close this issue and let the ipmi maintainers address it in the issue you opened.