openbmc / phosphor-state-manager

Apache License 2.0
11 stars 21 forks source link

mapper wait returns when dbus object is not on bus #2

Closed geissonator closed 5 years ago

geissonator commented 5 years ago

@geissonator commented on Fri Jan 11 2019

root@romulus:/lib/systemd/system# cat /etc/os-release 
ID="openbmc-phosphor"
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="v2.4-212"
VERSION_ID="v2.4-212-g16d6208d7"
PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v2.4-212"
BUILD_ID="v2.4"

I'm chasing an issue where a service intermittently fails to get a dbus object back from mapper. The service has the appropriate mapper wait in it's service file:

Wants=mapper-wait@-xyz-openbmc_project-control-host%i.service
After=mapper-wait@-xyz-openbmc_project-control-host%i.service

But still hits this intermittently:

Jan 08 23:29:35 romulus phosphor-host-check[1168]: Check if host is running
Jan 08 23:29:36 romulus phosphor-host-check[1168]: Error in mapper call for control host
Jan 08 23:29:36 romulus phosphor-host-check[1168]: terminate called after throwing an instance of 'sdbusplus::exception::SdBusError'
Jan 08 23:29:36 romulus phosphor-host-check[1168]:   what():  sd_bus_call: org.freedesktop.DBus.Error.FileNotFound: path or object not found

I can see in the journal that the ipmid process (which hosts this object) had been started but it wasn't clear whether it had put the appropriate dbus object on the bus.

So I started running some tests.

systemctl stop phosphor-ipmi-host.service

busctl call xyz.openbmc_project.ObjectMapper /xyz/openbmc_project/object_mapper xyz.openbmc_project.ObjectMapper GetObject sas "/xyz/openbmc_project/control/host0" 1 "xyz.openbmc_project.Control.Host"
path or object not found

mapper wait /xyz/openbmc_project/control/host0
# (this returned immediately with no error code when it should not have)

I did this same test on a code image that had the old mapper (v2.4-0-gc71bcdc) and verified this test passed as expected (the mapper command hung until I started the service back up).


@geissonator commented on Fri Jan 11 2019

I'll keep looking into this until someone with more knowledge on it wants to help :)


@geissonator commented on Mon Jan 14 2019

A bit of debug points to the issue being that xyz.openbmc_project.Settings also implements /xyz/openbmc_project/control/host0

root@romulus:/lib/systemd/system# busctl tree xyz.openbmc_project.Settings | cat
/xyz
/xyz/openbmc_project
/xyz/openbmc_project/control
/xyz/openbmc_project/control/host0
/xyz/openbmc_project/control/host0/TPMEnable
/xyz/openbmc_project/control/host0/auto_reboot
/xyz/openbmc_project/control/host0/boot
/xyz/openbmc_project/control/host0/boot/one_time
/xyz/openbmc_project/control/host0/power_cap
/xyz/openbmc_project/control/host0/power_restore_policy
/xyz/openbmc_project/control/host0/restriction_mode
/xyz/openbmc_project/control/host0/turbo_allowed
/xyz/openbmc_project/control/minimum_ship_level_required
/xyz/openbmc_project/control/power_supply_attributes
/xyz/openbmc_project/control/power_supply_redundancy
...
mapper get-service /xyz/openbmc_project/control/host0
xyz.openbmc_project.Control.Host

systemctl stop phosphor-ipmi-host.service

mapper get-service /xyz/openbmc_project/control/host0
xyz.openbmc_project.Settings

Need to see if there's a way with the mapper syntax to specify more details on what we're looking for. Otherwise may just need to encode a service dependency on phosphor-ipmi-host.service.


@geissonator commented on Mon Jan 14 2019

No obvious way to fix this within the mapper code. We discussed an optional command line parameter where you could pass the Interface but that would make the mapper service syntax very messy. For now, move this bug over to phosphor-state-manager and have the service specify explicitly which service it requires.

AlexanderAmelkin commented 5 years ago

@geissonator, the committed fix doesn't look correct. The phosphor-ipmi-host.service unit has ExecStart, but doesn't have Type. That means that Type defaults to simple, which means that systemd will consider phosphor-ipmi-host.service successfully started right after fork'ing off, even before exec'ing the ipmid executable. So there is still a race.

Besides, we've hit a very similar intermittent problem (on a substantially older build though) and we have a very interesting observation. Pay attention to the timestamps:

This is phosphor-reset-host-check.service starting:

root@suvorov:/lib/systemd/system# systemctl status phosphor-reset-host-check@0.service --no-pager
● phosphor-reset-host-check@0.service - Check Host0 status on BMC reset
   Loaded: loaded (/lib/systemd/system/obmc-host-reset@0.target.requires/../phosphor-reset-host-check@.service; disabled; vendor preset: enabled)
   Active: failed (Result: core-dump) since Fri 2019-01-18 08:55:01 UTC; 24min ago
  Process: 1051 ExecStart=/usr/bin/env phosphor-host-check (code=dumped, signal=ABRT)
 Main PID: 1051 (code=dumped, signal=ABRT)

Jan 18 08:54:33 suvorov systemd[1]: Starting Check Host0 status on BMC reset...
Jan 18 08:54:33 suvorov phosphor-host-check[1051]: Check if host is running
Jan 18 08:54:46 suvorov phosphor-host-check[1051]: Error in mapper call for control host
Jan 18 08:54:46 suvorov phosphor-host-check[1051]: terminate called after throwing an instance of 'std::runtime_error'
Jan 18 08:54:46 suvorov phosphor-host-check[1051]:   what():  Error in mapper call for control host
Jan 18 08:55:01 suvorov systemd[1]: phosphor-reset-host-check@0.service: Main process exited, code=dumped, status=6/ABRT
Jan 18 08:55:01 suvorov systemd[1]: Failed to start Check Host0 status on BMC reset.
Jan 18 08:55:01 suvorov systemd[1]: phosphor-reset-host-check@0.service: Unit entered failed state.
Jan 18 08:55:01 suvorov systemd[1]: phosphor-reset-host-check@0.service: Failed with result 'core-dump'.

And this is the mapper-wait that the prior service depends on via Wants and After:

root@suvorov:/lib/systemd/system# systemctl status mapper-wait\@-xyz-openbmc_project-control-host0.service --no-pager
● mapper-wait@-xyz-openbmc_project-control-host0.service - Wait for /xyz/openbmc_project/control/host0
   Loaded: loaded (/lib/systemd/system/mapper-wait@.service; static; vendor preset: enabled)
   Active: active (exited) since Fri 2019-01-18 08:55:26 UTC; 32min ago
  Process: 1059 ExecStart=/usr/bin/env mapper wait /xyz/openbmc_project/control/host0 (code=exited, status=0/SUCCESS)
 Main PID: 1059 (code=exited, status=0/SUCCESS)

Jan 18 08:54:35 suvorov systemd[1]: Starting Wait for /xyz/openbmc_project/control/host0...
Jan 18 08:55:26 suvorov systemd[1]: Started Wait for /xyz/openbmc_project/control/host0.

As you can see, the dependant service is somehow started earlier (08:54:33) than the service it depends on (08:54:35).

This looks like a bug in systemd and doesn't look like it has anything to do with multiple services providing the same d-bus object path.

What do you think?

geissonator commented 5 years ago

Yes, good points @AlexanderAmelkin , my change did not fix our issue, reopening to look at your thoughts above.

geissonator commented 5 years ago

That def does look like a systemd bug. I wonder if you've see the same up on our latest Thud release?

I put a fix up for the ipmid dbus issue you noted in https://gerrit.openbmc-project.xyz/#/c/openbmc/meta-phosphor/+/17536/

AlexanderAmelkin commented 5 years ago

@geissonator, we've checked with more or less recent https://github.com/openbmc/openbmc/commit/50ea7cc89521929e29d0836367fa8b06776ce4c2 with your patch from gerrit applied, and it still fails with very similar out-of-order timestamps:

root@Kutuzov:~# systemctl status phosphor-reset-host-check@0.service --no-pager
● phosphor-reset-host-check@0.service - Check Host0 status on BMC reset
   Loaded: loaded ( ]8;;file://Kutuzov/lib/systemd/system/obmc-host-reset@0.target.requires/../phosphor-reset-host-check@.service/lib/systemd/system/obmc-host-reset@0.target.requires/../phosphor-reset-host-check@.service ]8;;; static; vendor preset: enabled)
   Active: failed (Result: signal) since Wed 2019-01-23 11:52:18 UTC; 2min 10s ago
  Process: 1121 ExecStart=/usr/bin/env phosphor-host-check (code=killed, signal=ABRT)
 Main PID: 1121 (code=killed, signal=ABRT)

Jan 23 11:52:14 Kutuzov systemd[1]: Starting Check Host0 status on BMC reset...
Jan 23 11:52:16 Kutuzov phosphor-host-check[1121]: Check if host is running
Jan 23 11:52:17 Kutuzov phosphor-host-check[1121]: Error in mapper call for control host
Jan 23 11:52:17 Kutuzov phosphor-host-check[1121]: terminate called after throwing an instance of 'sdbusplus::exception::SdBusError'
Jan 23 11:52:17 Kutuzov phosphor-host-check[1121]:   what():  sd_bus_call: org.freedesktop.DBus.Error.FileNotFound: path or object not found
Jan 23 11:52:18 Kutuzov systemd[1]: phosphor-reset-host-check@0.service: Main process exited, code=killed, status=6/ABRT
Jan 23 11:52:18 Kutuzov systemd[1]: phosphor-reset-host-check@0.service: Failed with result 'signal'.
Jan 23 11:52:18 Kutuzov systemd[1]: Failed to start Check Host0 status on BMC reset.

root@Kutuzov:~# systemctl status phosphor-ipmi-host.service --no-pager
● phosphor-ipmi-host.service - Phosphor Inband IPMI
   Loaded: loaded ( ]8;;file://Kutuzov/lib/systemd/system/phosphor-ipmi-host.service/lib/systemd/system/phosphor-ipmi-host.service ]8;;; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2019-01-23 11:52:33 UTC; 2min 14s ago
 Main PID: 1139 (ipmid)
   CGroup: /system.slice/phosphor-ipmi-host.service
           └─1139 ipmid

Jan 23 11:52:21 Kutuzov systemd[1]: Starting Phosphor Inband IPMI...
Jan 23 11:52:30 Kutuzov ipmid[1139]: JSON file not found
Jan 23 11:52:33 Kutuzov systemd[1]: Started Phosphor Inband IPMI.

root@Kutuzov:~# cat /lib/systemd/system/phosphor-reset-host-check@.service
[Unit]
Description=Check Host%i status on BMC reset
Wants=phosphor-ipmi-host.service
After=phosphor-ipmi-host.service
Wants=obmc-host-reset-running@%i.target
Before=obmc-host-reset-running@%i.target
Wants=op-reset-chassis-on@%i.service
After=op-reset-chassis-on@%i.service
Conflicts=obmc-host-stop@%i.target
ConditionPathExists=/run/openbmc/chassis@%i-on

[Service]
RemainAfterExit=yes
Type=oneshot
ExecStart=/usr/bin/env phosphor-host-check
SyslogIdentifier=phosphor-host-check

[Install]
WantedBy=obmc-host-reset@%i.target

So it looks like a systemd bug indeed.

geissonator commented 5 years ago

Yeah, that's not good.

geissonator commented 5 years ago

I'm not seeing this issue on my system:

root@witherspoon:~# systemctl status phosphor-reset-host-check@0.service --no-pager
\u25cf phosphor-reset-host-check@0.service - Check Host0 status on BMC reset
   Loaded: loaded (]8;;file://witherspoon/lib/systemd/system/obmc-host-reset@0.target.requires/../phosphor-reset-host-check@.service/lib/systemd/system/obmc-host-reset@0.target.requires/../phosphor-reset-host-check@.service]8;;; static; vendor preset: enabled)
   Active: active (exited) since Fri 2019-01-25 19:37:49 UTC; 1min 49s ago
  Process: 1474 ExecStart=/usr/bin/env phosphor-host-check (code=exited, status=0/SUCCESS)
 Main PID: 1474 (code=exited, status=0/SUCCESS)

Jan 25 19:37:44 witherspoon systemd[1]: Starting Check Host0 status on BMC reset...
Jan 25 19:37:44 witherspoon phosphor-host-check[1474]: Check if host is running
Jan 25 19:37:45 witherspoon phosphor-host-check[1474]: Host is running!
Jan 25 19:37:49 witherspoon systemd[1]: Started Check Host0 status on BMC reset.
root@witherspoon:~# 
root@witherspoon:~# systemctl status phosphor-ipmi-host.service --no-pager
\u25cf phosphor-ipmi-host.service - Phosphor Inband IPMI
   Loaded: loaded (]8;;file://witherspoon/lib/systemd/system/phosphor-ipmi-host.service/lib/systemd/system/phosphor-ipmi-host.service]8;;; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2019-01-25 19:37:41 UTC; 2min 15s ago
 Main PID: 1433 (ipmid)
   CGroup: /system.slice/phosphor-ipmi-host.service
           \u2514\u25001433 ipmid

Jan 25 19:37:32 witherspoon systemd[1]: Starting Phosphor Inband IPMI...
Jan 25 19:37:35 witherspoon ipmid[1433]: JSON file not found
Jan 25 19:37:41 witherspoon systemd[1]: Started Phosphor Inband IPMI.
AlexanderAmelkin commented 5 years ago

@geissonator, we have a working scenario to easily reproduce this problem:

  1. Log in to BMC as root
  2. Run obmcutil poweron
  3. Run obmc-console-client and wait till the host is booted to the OS
  4. Reboot BMC (reboot)
  5. As soon as connection to BMC is terminated, wait about 30 seconds
  6. Start attempting to connect to BMC until it lets you in as root
  7. As soon as you're in to a not fully booted BMC, start running obmcutil state every 1-2 seconds until it reports full system state
  8. Observe the (Chassis.PowerState.On , Host.HostState.Off) state while the host is actually running
geissonator commented 5 years ago

This is the test I ran. There will be a period of time where the chassis is on and the host is off (that's the sequence in the service files). After a bit of time though (5 seconds or so) you see the chassis on and the host running though right? The host check needs to wait for the chassis check and when it runs, it needs to wait for the ipmid service to be up and running and for the host to respond to it's query to know it's running.

geissonator commented 5 years ago

Of course if you're hitting the bug you noted earlier ("Jan 23 11:52:17 Kutuzov phosphor-host-check[1121]: what(): sd_bus_call: org.freedesktop.DBus.Error.FileNotFound: path or object not found") then you'll never see the host state change. But our test team has run this test (reboot BMC with host up) a multitude of times without issue with this patch on Romulus and Witherspoon.