openbmc / phosphor-state-manager

Apache License 2.0
11 stars 21 forks source link

Witherspoon : phosphor-chassis-state-manager: terminate called after throwing an instance of 'sdbusplus::exception::SdBusError' #4

Closed gkeishin closed 2 years ago

gkeishin commented 5 years ago
Mar 26 14:46:47 witherspoon phosphor-chassis-state-manager[1230]: terminate called after throwing an instance of 'sdbusplus::exception::SdBusError'
Mar 26 14:46:47 witherspoon phosphor-chassis-state-manager[1230]:   what():  sd_bus_call noreply: org.freedesktop.DBus.Error.Timeout: Connection timed out
Mar 26 14:46:47 witherspoon systemd[1]: xyz.openbmc_project.State.Chassis.service: Main process exited, code=killed, status=6/ABRT
Mar 26 14:46:47 witherspoon systemd[1]: xyz.openbmc_project.State.Chassis.service: Failed with result 'signal'.
{
    "__CURSOR" : "s=fc4280edf7bc43ff86244ffb7681f46c;i=d9e;b=26106d7f20d84ae7ac618df37251863d;m=4e08824;t=5850063fde8d5;x=2b43dc96df4f3b69",
    "__REALTIME_TIMESTAMP" : "1553611607632085",
    "__MONOTONIC_TIMESTAMP" : "81823780",
    "_BOOT_ID" : "26106d7f20d84ae7ac618df37251863d",
    "PRIORITY" : "6",
    "_SYSTEMD_SLICE" : "system.slice",
    "_MACHINE_ID" : "5a1711ca1bfd4ff799e7aebfbc6fee9d",
    "_HOSTNAME" : "witherspoon",
    "_UID" : "0",
    "_GID" : "0",
    "_CAP_EFFECTIVE" : "3fffffffff",
    "SYSLOG_FACILITY" : "3",
    "_TRANSPORT" : "stdout",
    "_STREAM_ID" : "3e9b0728d4dd43d8943d396b5d6f0c16",
    "SYSLOG_IDENTIFIER" : "phosphor-chassis-state-manager",
    "MESSAGE" : "terminate called after throwing an instance of 'sdbusplus::exception::SdBusError'",
    "_PID" : "1230",
    "_COMM" : "phosphor-chassi",
    "_EXE" : "/usr/sbin/phosphor-chassis-state-manager",
    "_CMDLINE" : "phosphor-chassis-state-manager",
    "_SYSTEMD_CGROUP" : "/system.slice/xyz.openbmc_project.State.Chassis.service",
    "_SYSTEMD_UNIT" : "xyz.openbmc_project.State.Chassis.service",
    "_SYSTEMD_INVOCATION_ID" : "67615acd29f44f56a83cc9d625425fed"
}
{
    "__CURSOR" : "s=fc4280edf7bc43ff86244ffb7681f46c;i=d9f;b=26106d7f20d84ae7ac618df37251863d;m=4e0b844;t=5850063fe18f4;x=70d54893496af971",
    "__REALTIME_TIMESTAMP" : "1553611607644404",
    "__MONOTONIC_TIMESTAMP" : "81836100",
    "_BOOT_ID" : "26106d7f20d84ae7ac618df37251863d",
    "PRIORITY" : "6",
    "_SYSTEMD_SLICE" : "system.slice",
    "_MACHINE_ID" : "5a1711ca1bfd4ff799e7aebfbc6fee9d",
    "_HOSTNAME" : "witherspoon",
    "_UID" : "0",
    "_GID" : "0",
    "_CAP_EFFECTIVE" : "3fffffffff",
    "SYSLOG_FACILITY" : "3",
    "_TRANSPORT" : "stdout",
    "_STREAM_ID" : "3e9b0728d4dd43d8943d396b5d6f0c16",
    "SYSLOG_IDENTIFIER" : "phosphor-chassis-state-manager",
    "_PID" : "1230",
    "_COMM" : "phosphor-chassi",
    "_EXE" : "/usr/sbin/phosphor-chassis-state-manager",
    "_CMDLINE" : "phosphor-chassis-state-manager",
    "_SYSTEMD_CGROUP" : "/system.slice/xyz.openbmc_project.State.Chassis.service",
    "_SYSTEMD_UNIT" : "xyz.openbmc_project.State.Chassis.service",
    "_SYSTEMD_INVOCATION_ID" : "67615acd29f44f56a83cc9d625425fed",
    "MESSAGE" : "  what():  sd_bus_call noreply: org.freedesktop.DBus.Error.Timeout: Connection timed out"
}
root@witherspoon:~# cat /etc/os-release
ID="openbmc-phosphor"
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="2.7.0-dev"
VERSION_ID="2.7.0-dev-162-g052de7c"
PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) 2.7.0-dev"
BUILD_ID="2.7.0-dev"
OPENBMC_TARGET_MACHINE="witherspoon"
root@witherspoon:~#
geissonator commented 5 years ago

@wak-google we're hitting this intermittently in CI. Since we're not seeing the log from our catch clause ("Error performing call to get pgood"), I assume it's the systemd signal subscription in the constructor causing it? Haven't changed any code in this area in a while so interested in anyone's thoughts or brainstorms on what to look at.

gkeishin commented 5 years ago
Apr 12 19:58:40 witherspoon phosphor-chassis-state-manager[1312]: terminate called after throwing an instance of 'sdbusplus::exception::SdBusError'
Apr 12 19:58:40 witherspoon phosphor-chassis-state-manager[1312]:   what():  sd_bus_call noreply: org.freedesktop.DBus.Error.Timeout: Connection timed out
Apr 12 19:58:40 witherspoon systemd[1]: xyz.openbmc_project.State.Chassis.service: Main process exited, code=killed, status=6/ABRT
Apr 12 19:58:40 witherspoon phosphor-network-manager[1301]: Failed to reset failed unit
Apr 12 19:58:40 witherspoon phosphor-network-manager[1301]: sdeventplus: timeCallback: xyz.openbmc_project.Common.Error.InternalFailure: The operation failed internally.
Apr 12 19:58:40 witherspoon phosphor-network-manager[1301]: The operation failed internally.
Apr 12 19:58:40 witherspoon phosphor-network-manager[1301]: Refreshing the objects.
Apr 12 19:58:40 witherspoon systemd[1]: xyz.openbmc_project.State.Chassis.service: Failed with result 'signal'.
Apr 12 19:58:40 witherspoon systemd[1]: Failed to start Phosphor Chassis State Manager.
Apr 12 19:58:42 witherspoon systemd[1]: xyz.openbmc_project.State.Chassis.service: Service RestartSec=1s expired, scheduling restart.
Apr 12 19:58:42 witherspoon systemd[1]: xyz.openbmc_project.State.Chassis.service: Scheduled restart job, restart counter is at 1

seeing in master too on reboot to standby now and then..

geissonator commented 5 years ago

Some interesting data from these recent recreates (we see this in HW CI ~5% of the time).

Both phosphor-chassis-state-manager and phosphor-network-manager hit the same failure at the same time.

The error is "ERR" : "sd_bus_call noreply: org.freedesktop.DBus.Error.Timeout: Connection timed out", from the details of the "Failed to reset failed unit" message which is the same as phosphor-chassis-state-manager. Both applications are making a call to systemd. At the same time as all of this "obmc-flash-bmc-ubiremount.service" is running which generates a lot of kernel ubi messages.

So one theory is that the UBI process is hogging enough CPU cycles that the two services calling into systemd over dbus timeout. Could systemd be getting tied up on certain events from the UBI mounting?

Load average, taken after the system reaches BMC Ready is def high: 17:57:02 up 3 min, load average: 3.22, 2.01, 0.81

Here's a fuller trace of the fail:

May 13 17:55:28 witherspoon kernel: block ubiblock0_1: created from ubi0:1(rofs-d73fbf54)
May 13 17:55:28 witherspoon kernel[1176]: [   94.532055] block ubiblock0_1: created from ubi0:1(rofs-d73fbf54)
May 13 17:55:29 witherspoon kernel: block ubiblock8_0: created from ubi8:0(pnor-ro-ace821ef)
May 13 17:55:29 witherspoon kernel[1176]: [   95.347417] block ubiblock8_0: created from ubi8:0(pnor-ro-ace821ef)
May 13 17:55:30 witherspoon kernel: UBIFS (ubi8:1): Mounting in unauthenticated mode
May 13 17:55:30 witherspoon kernel[1176]: [   95.923727] UBIFS (ubi8:1): Mounting in unauthenticated mode
May 13 17:55:30 witherspoon kernel: UBIFS (ubi8:1): background thread "ubifs_bgt8_1" started, PID 1381
May 13 17:55:30 witherspoon kernel[1176]: [   95.983651] UBIFS (ubi8:1): background thread "ubifs_bgt8_1" started, PID 1381
May 13 17:55:30 witherspoon kernel: UBIFS (ubi8:1): UBIFS: mounted UBI device 8, volume 1, name "pnor-rw-ace821ef"
May 13 17:55:30 witherspoon kernel: UBIFS (ubi8:1): LEB size: 65408 bytes (63 KiB), min./max. I/O unit sizes: 8 bytes/256 bytes
May 13 17:55:30 witherspoon kernel: UBIFS (ubi8:1): FS size: 16155776 bytes (15 MiB, 247 LEBs), journal size 784896 bytes (0 MiB, 12 LEBs)
May 13 17:55:30 witherspoon kernel: UBIFS (ubi8:1): reserved for root: 763076 bytes (745 KiB)
May 13 17:55:30 witherspoon kernel: UBIFS (ubi8:1): media format: w5/r0 (latest is w5/r0), UUID 6F01001B-FB8C-4AE9-AFFE-676C4639A20B, small LPT model
May 13 17:55:30 witherspoon kernel[1176]: [   96.422835] UBIFS (ubi8:1): UBIFS: mounted UBI device 8, volume 1, name "pnor-rw-ace821ef"
May 13 17:55:30 witherspoon kernel[1176]: [   96.431207] UBIFS (ubi8:1): LEB size: 65408 bytes (63 KiB), min./max. I/O unit sizes: 8 bytes/256 bytes
May 13 17:55:30 witherspoon kernel[1176]: [   96.440870] UBIFS (ubi8:1): FS size: 16155776 bytes (15 MiB, 247 LEBs), journal size 784896 bytes (0 MiB, 12 LEBs)
May 13 17:55:30 witherspoon kernel[1176]: [   96.451368] UBIFS (ubi8:1): reserved for root: 763076 bytes (745 KiB)
May 13 17:55:30 witherspoon kernel[1176]: [   96.457910] UBIFS (ubi8:1): media format: w5/r0 (latest is w5/r0), UUID 6F01001B-FB8C-4AE9-AFFE-676C4639A20B, small LPT model
May 13 17:55:31 witherspoon obmc-flash-bmc[1309]: wdt2bite=mw.l 0x1e785024 0xa 1; mw.b 0x1e78502c 0xb3 1
May 13 17:55:31 witherspoon kernel: UBIFS (ubi8:2): Mounting in unauthenticated mode
May 13 17:55:31 witherspoon kernel[1176]: [   97.034190] UBIFS (ubi8:2): Mounting in unauthenticated mode
May 13 17:55:31 witherspoon kernel: UBIFS (ubi8:2): background thread "ubifs_bgt8_2" started, PID 1404
May 13 17:55:31 witherspoon kernel[1176]: [   97.105748] UBIFS (ubi8:2): background thread "ubifs_bgt8_2" started, PID 1404
May 13 17:55:31 witherspoon kernel: UBIFS (ubi4:2): Mounting in unauthenticated mode
May 13 17:55:31 witherspoon kernel[1176]: [   97.403089] UBIFS (ubi4:2): Mounting in unauthenticated mode
May 13 17:55:31 witherspoon kernel: UBIFS (ubi4:2): background thread "ubifs_bgt4_2" started, PID 1408
May 13 17:55:31 witherspoon kernel[1176]: [   97.458899] UBIFS (ubi4:2): background thread "ubifs_bgt4_2" started, PID 1408
May 13 17:55:31 witherspoon kernel: UBIFS (ubi8:2): UBIFS: mounted UBI device 8, volume 2, name "pnor-prsv"
May 13 17:55:31 witherspoon kernel: UBIFS (ubi8:2): LEB size: 65408 bytes (63 KiB), min./max. I/O unit sizes: 8 bytes/256 bytes
May 13 17:55:31 witherspoon kernel: UBIFS (ubi8:2): FS size: 1504384 bytes (1 MiB, 23 LEBs), journal size 523265 bytes (0 MiB, 6 LEBs)
May 13 17:55:31 witherspoon kernel: UBIFS (ubi8:2): reserved for root: 71055 bytes (69 KiB)
May 13 17:55:31 witherspoon kernel: UBIFS (ubi8:2): media format: w5/r0 (latest is w5/r0), UUID 2D66F83D-4E82-447A-89ED-639BAE5B758C, small LPT model
May 13 17:55:31 witherspoon kernel[1176]: [   97.477532] UBIFS (ubi8:2): UBIFS: mounted UBI device 8, volume 2, name "pnor-prsv"
May 13 17:55:31 witherspoon kernel[1176]: [   97.485362] UBIFS (ubi8:2): LEB size: 65408 bytes (63 KiB), min./max. I/O unit sizes: 8 bytes/256 bytes
May 13 17:55:31 witherspoon kernel[1176]: [   97.494883] UBIFS (ubi8:2): FS size: 1504384 bytes (1 MiB, 23 LEBs), journal size 523265 bytes (0 MiB, 6 LEBs)
May 13 17:55:31 witherspoon kernel[1176]: [   97.504976] UBIFS (ubi8:2): reserved for root: 71055 bytes (69 KiB)
May 13 17:55:31 witherspoon kernel[1176]: [   97.511259] UBIFS (ubi8:2): media format: w5/r0 (latest is w5/r0), UUID 2D66F83D-4E82-447A-89ED-639BAE5B758C, small LPT model
May 13 17:55:32 witherspoon kernel: UBIFS (ubi4:2): UBIFS: mounted UBI device 4, volume 2, name "rwfs"
May 13 17:55:32 witherspoon kernel: UBIFS (ubi4:2): LEB size: 65408 bytes (63 KiB), min./max. I/O unit sizes: 8 bytes/256 bytes
May 13 17:55:32 witherspoon kernel: UBIFS (ubi4:2): FS size: 3597440 bytes (3 MiB, 55 LEBs), journal size 719488 bytes (0 MiB, 11 LEBs)
May 13 17:55:32 witherspoon kernel: UBIFS (ubi4:2): reserved for root: 0 bytes (0 KiB)
May 13 17:55:32 witherspoon kernel: UBIFS (ubi4:2): media format: w4/r0 (latest is w5/r0), UUID DF80CC8C-601F-4EAB-B116-2169D8ED609C, small LPT model
May 13 17:55:32 witherspoon kernel[1176]: [   97.611532] UBIFS (ubi4:2): UBIFS: mounted UBI device 4, volume 2, name "rwfs"
May 13 17:55:32 witherspoon kernel[1176]: [   97.618904] UBIFS (ubi4:2): LEB size: 65408 bytes (63 KiB), min./max. I/O unit sizes: 8 bytes/256 bytes
May 13 17:55:32 witherspoon kernel[1176]: [   97.628432] UBIFS (ubi4:2): FS size: 3597440 bytes (3 MiB, 55 LEBs), journal size 719488 bytes (0 MiB, 11 LEBs)
May 13 17:55:32 witherspoon kernel[1176]: [   97.638624] UBIFS (ubi4:2): reserved for root: 0 bytes (0 KiB)
May 13 17:55:32 witherspoon kernel[1176]: [   97.644563] UBIFS (ubi4:2): media format: w4/r0 (latest is w5/r0), UUID DF80CC8C-601F-4EAB-B116-2169D8ED609C, small LPT model
May 13 17:55:32 witherspoon systemd[1]: Started Remount the squashfs and ubi volumes after a reboot.
May 13 17:55:32 witherspoon systemd[1]: obmc-flash-bmc-ubiremount.service: Succeeded.
May 13 17:55:32 witherspoon systemd[1]: Started Remount the ubi volume after a reboot.
May 13 17:55:33 witherspoon phosphor-chassis-state-manager[1279]: terminate called after throwing an instance of 'sdbusplus::exception::SdBusError'
May 13 17:55:33 witherspoon phosphor-chassis-state-manager[1279]:   what():  sd_bus_call noreply: org.freedesktop.DBus.Error.Timeout: Connection timed out
May 13 17:55:33 witherspoon systemd[1]: xyz.openbmc_project.State.Chassis.service: Main process exited, code=killed, status=6/ABRT
May 13 17:55:33 witherspoon systemd[1]: xyz.openbmc_project.State.Chassis.service: Failed with result 'signal'.
May 13 17:55:33 witherspoon systemd[1]: Failed to start Phosphor Chassis State Manager.
May 13 17:55:34 witherspoon systemd[1]: xyz.openbmc_project.State.Chassis.service: Service RestartSec=1s expired, scheduling restart.
May 13 17:55:34 witherspoon systemd[1]: xyz.openbmc_project.State.Chassis.service: Scheduled restart job, restart counter is at 1.
May 13 17:55:36 witherspoon phosphor-network-manager[1278]: Failed to reset failed unit
May 13 17:55:36 witherspoon phosphor-network-manager[1278]: The operation failed internally.
May 13 17:55:36 witherspoon phosphor-network-manager[1278]: sdeventplus: timeCallback: xyz.openbmc_project.Common.Error.InternalFailure: The operation failed internally.
May 13 17:55:36 witherspoon phosphor-network-manager[1278]: Refreshing the objects.
May 13 17:55:39 witherspoon systemd[1]: Started Phosphor Network Manager.
wak-google commented 5 years ago

I believe we are calling the synchronous restart operation, so it could be that the load causes that restart process to take a long time?

geissonator commented 5 years ago

Hmm, well we have two services failing here. The call in phosphor-network-manager which is calling systemd with a "ResetFailedUnit" command. phosphor-chassis-state-manager I think is just loading it's constructor in phosphor-state-manager/chassis_state_manager.hpp which subscribes to a systemd signal via sdbusplus and sets up a sdeventplus::Event timer. Do you think trying to add a bigger timeout to these calls would help? I'm not sure if I can even configure a timeout with the constructor calls.

A hack would be to just put these services After the obmc-flash-bmc-ubiremount.service but then we could just end up with other services hitting this.

geissonator commented 5 years ago

Actually, it's probably this code that's failing in chassis-state manager - https://github.com/openbmc/phosphor-state-manager/blob/master/chassis_state_manager.cpp#L50

@wak-google do you know, when no timeout is passed to call_noreply, what the default is?

geissonator commented 5 years ago

I put up a commit here to at least help verify it's where we think it is - https://gerrit.openbmc-project.xyz/c/openbmc/phosphor-state-manager/+/21520

geissonator commented 5 years ago

Got a recreate with the additional FFDC:

May 21 11:10:18 witherspoon phosphor-chassis-state-manager[1285]: Failed to subscribe to systemd signals
May 21 11:10:18 witherspoon phosphor-chassis-state-manager[1285]: The operation failed internally.
May 21 11:10:19 witherspoon phosphor-chassis-state-manager[1285]: terminate called after throwing an instance of 'sdbusplus::xyz::openbmc_project::Common::Error::InternalFailure'
May 21 11:10:19 witherspoon phosphor-chassis-state-manager[1285]:   what():  xyz.openbmc_project.Common.Error.InternalFailure: The operation failed internally.

{
    "MESSAGE" : "Failed to subscribe to systemd signals",
    "ERR" : "sd_bus_call noreply: org.freedesktop.DBus.Error.Timeout: Connection timed out"
}
geissonator commented 5 years ago

@wak-google one thing that's interesting here is that both the phosphor-bmc-state-manager and the phosphor-chassis-state-manager are being started at the same time and doing the same thing (subscribing to systemd signals).

May 21 11:09:52 witherspoon systemd[1]: Starting Phosphor BMC State Manager...
May 21 11:09:53 witherspoon systemd[1]: Starting Phosphor Chassis State Manager...
...
May 21 11:10:18 witherspoon phosphor-chassis-state-manager[1285]: Failed to subscribe to systemd signals
...
May 21 11:10:29 witherspoon phosphor-bmc-state-manager[1282]: Setting the BMCState field

You can see the chassis times out but the bmc one works. The only difference I see is that the chassis app is calling "bus.call_noreply()" and the bmc is calling "bus.call()". It seems somewhat counter-intuitive that the call_noreply() would be the one to timeout but in lieu of other ideas, I may just move the chassis call to bus.call().

The network_manager code that fails in this path is also using call_noreply(). Wondering if call_noreply() just has a shorter timeout or something?

wak-google commented 5 years ago

bus.call and bus.call_noreply use the exact same underlying call and timeouts.

geissonator commented 5 years ago

bus.call and bus.call_noreply use the exact same underlying call and timeouts.

Hmm, but one passes a valid reply object to sd_bus_call and the other does not. Do you think that changes the behavior of the call?

wak-google commented 5 years ago

Not in any significant way. It only references that variable right before return.

geissonator commented 5 years ago

This continues to hit us intermittently in HW CI. I think I'm going to just try the bus.call(). It wont hurt anything and will at least help rule it out if it doesn't work.

geissonator commented 5 years ago

Even with the change above, we are still seeing this intermittently in testing. It has the following 3 symptoms:

phosphor-chassis-state-manager failure

Aug 16 07:19:16 witherspoon-YL30UF74T02X kernel: block ubiblock4_0: created from ubi4:0(rofs-91691bd9)
Aug 16 07:19:16 witherspoon-YL30UF74T02X obmc-flash-bmc[267]: wdt2bite=mw.l 0x1e785024 0xa 1; mw.b 0x1e78502c 0xb3 1
Aug 16 07:19:16 witherspoon-YL30UF74T02X kernel: UBIFS (ubi4:2): Mounting in unauthenticated mode
Aug 16 07:19:16 witherspoon-YL30UF74T02X kernel: UBIFS (ubi4:2): background thread "ubifs_bgt4_2" started, PID 339
Aug 16 07:19:17 witherspoon-YL30UF74T02X kernel: UBIFS (ubi4:2): UBIFS: mounted UBI device 4, volume 2, name "rwfs"
Aug 16 07:19:17 witherspoon-YL30UF74T02X kernel: UBIFS (ubi4:2): LEB size: 65408 bytes (63 KiB), min./max. I/O unit sizes: 8 bytes/256 bytes
Aug 16 07:19:17 witherspoon-YL30UF74T02X kernel: UBIFS (ubi4:2): FS size: 3597440 bytes (3 MiB, 55 LEBs), journal size 719488 bytes (0 MiB, 11 LEBs)
Aug 16 07:19:17 witherspoon-YL30UF74T02X kernel: UBIFS (ubi4:2): reserved for root: 0 bytes (0 KiB)
Aug 16 07:19:17 witherspoon-YL30UF74T02X kernel: UBIFS (ubi4:2): media format: w4/r0 (latest is w5/r0), UUID DF80CC8C-601F-4EAB-B116-2169D8ED609C, small LPT model
Aug 16 07:19:17 witherspoon-YL30UF74T02X systemd[1]: obmc-flash-bmc-ubiremount.service: Succeeded.
Aug 16 07:19:17 witherspoon-YL30UF74T02X systemd[1]: Started Remount the ubi volume after a reboot.
Aug 16 07:19:18 witherspoon-YL30UF74T02X kernel: block ubiblock8_0: created from ubi8:0(pnor-ro-ace821ef)
Aug 16 07:19:18 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:1): Mounting in unauthenticated mode
Aug 16 07:19:18 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:1): background thread "ubifs_bgt8_1" started, PID 348
Aug 16 07:19:18 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:1): UBIFS: mounted UBI device 8, volume 1, name "pnor-rw-ace821ef"
Aug 16 07:19:18 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:1): LEB size: 65408 bytes (63 KiB), min./max. I/O unit sizes: 8 bytes/256 bytes
Aug 16 07:19:18 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:1): FS size: 16155776 bytes (15 MiB, 247 LEBs), journal size 784896 bytes (0 MiB, 12 LEBs)
Aug 16 07:19:19 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:1): reserved for root: 763076 bytes (745 KiB)
Aug 16 07:19:19 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:1): media format: w5/r0 (latest is w5/r0), UUID 6F01001B-FB8C-4AE9-AFFE-676C4639A20B, small LPT model
Aug 16 07:19:19 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:2): Mounting in unauthenticated mode
Aug 16 07:19:19 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:2): background thread "ubifs_bgt8_2" started, PID 355
Aug 16 07:19:19 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:2): UBIFS: mounted UBI device 8, volume 2, name "pnor-prsv"
Aug 16 07:19:19 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:2): LEB size: 65408 bytes (63 KiB), min./max. I/O unit sizes: 8 bytes/256 bytes
Aug 16 07:19:19 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:2): FS size: 1504384 bytes (1 MiB, 23 LEBs), journal size 523265 bytes (0 MiB, 6 LEBs)
Aug 16 07:19:19 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:2): reserved for root: 71055 bytes (69 KiB)
Aug 16 07:19:19 witherspoon-YL30UF74T02X kernel: UBIFS (ubi8:2): media format: w5/r0 (latest is w5/r0), UUID 2D66F83D-4E82-447A-89ED-639BAE5B758C, small LPT model
Aug 16 07:19:19 witherspoon-YL30UF74T02X systemd[1]: Started Remount the squashfs and ubi volumes after a reboot.
Aug 16 07:19:21 witherspoon-YL30UF74T02X phosphor-chassis-state-manager[234]: Failed to subscribe to systemd signals
Aug 16 07:19:21 witherspoon-YL30UF74T02X phosphor-chassis-state-manager[234]: The operation failed internally.
Aug 16 07:19:21 witherspoon-YL30UF74T02X phosphor-chassis-state-manager[234]: terminate called after throwing an instance of 'sdbusplus::xyz::openbmc_project::Common::Error::InternalFailure'
Aug 16 07:19:21 witherspoon-YL30UF74T02X phosphor-chassis-state-manager[234]:   what():  xyz.openbmc_project.Common.Error.InternalFailure: The operation failed internally.
Aug 16 07:19:21 witherspoon-YL30UF74T02X systemd[1]: xyz.openbmc_project.State.Chassis.service: Main process exited, code=killed, status=6/ABRT
Aug 16 07:19:21 witherspoon-YL30UF74T02X systemd[1]: xyz.openbmc_project.State.Chassis.service: Failed with result 'signal'.
Aug 16 07:19:21 witherspoon-YL30UF74T02X systemd[1]: Failed to start Phosphor Chassis State Manager.
Aug 16 07:19:22 witherspoon-YL30UF74T02X systemd[1]: xyz.openbmc_project.State.Chassis.service: Service RestartSec=1s expired, scheduling restart.
Aug 16 07:19:22 witherspoon-YL30UF74T02X systemd[1]: xyz.openbmc_project.State.Chassis.service: Scheduled restart job, restart counter is at 1.

phosphor-network-manager failure

Jul 31 04:15:19 witherspoon phosphor-network-manager[1279]: Failed to reset failed unit
Jul 31 04:15:19 witherspoon phosphor-network-manager[1279]: The operation failed internally.

phosphor-bmc-state-manager failure

Jul 31 04:15:17 witherspoon phosphor-bmc-state-manager[1278]: Error in Subscribe

All three fail on calls to the systemd service (org.freedesktop.systemd1) during a BMC boot.

geissonator commented 5 years ago

I tried a test today where in one ssh session I ran the following:

while true; do busctl call org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager GetUnitFileState s "obmc-chassis-poweron@0.target"; don

And in another I ran:

while true; do systemctl restart xyz.openbmc_project.State.Host xyz.openbmc_project.State.Chassis xyz.openbmc_project.State.BMC; done

The BMC uptime was running around 3 (the rngd bug was help with that) and the restart of the services took forever...but it did complete, and the services were able to communicate with systemd over D-bus. I was never able to recreate a timeout to the sysetmd D-bus service. This leads me to believe it's not specifically related to CPU utilization causing the problem.

geissonator commented 4 years ago

I've created a patch where instead of using the default 25 seconds for a timeout on dbus, we will wait indefinitely. George is testing this out. If it works, then we know it is related to timeouts on dbus. If it doesn't work then it'll be another clue for us going forward.

geissonator commented 4 years ago

The timeout increase appeared to work. Since this seems UBI filesystem specific, and we are moving away from UBI for future products, I'm going to just do a downstream fix that increases our dbus timeout. http://www.linux-mtd.infradead.org/doc/ubi.html#L_scalability has some good reading on why UBI takes such a big chunk of CPU when it first mounts a filesystem.

geissonator commented 2 years ago

No longer seeing this issue in CI so closing out.