openbmc / phosphor-state-manager

Apache License 2.0
11 stars 21 forks source link

state discover,, ncsi and other services crashed on Host reboot path #14

Closed gkeishin closed 4 years ago

gkeishin commented 4 years ago
root@xx.xx.xx.xx:~# systemctl status phosphor-discover-system-state@0.service | cat
* phosphor-discover-system-state@0.service - Reboot If Enabled
     Loaded: loaded (/lib/systemd/system/phosphor-discover-system-state@.service; static; vendor preset: enabled)
     Active: failed (Result: start-limit-hit) since Fri 2017-02-17 14:46:52 UTC; 17min ago
    Process: 1220 ExecStart=/usr/bin/phosphor-discover-system-state --host 0 (code=exited, status=0/SUCCESS)
   Main PID: 1220 (code=exited, status=0/SUCCESS)

Feb 17 14:46:51 xx.xx.xx.xx systemd[1]: Starting Reboot If Enabled...
Feb 17 14:46:51 xx.xx.xx.xx phosphor-discover-system-state[1220]: Host power is off, checking power policy
Feb 17 14:46:52 xx.xx.xx.xx systemd[1]: phosphor-discover-system-state@0.service: Succeeded.
Feb 17 14:46:52 xx.xx.xx.xx systemd[1]: Finished Reboot If Enabled.
Feb 17 14:47:00 xx.xx.xx.xx systemd[1]: phosphor-discover-system-state@0.service: Start request repeated too quickly.
Feb 17 14:47:00 xx.xx.xx.xx systemd[1]: phosphor-discover-system-state@0.service: Failed with result 'start-limit-hit'.
Feb 17 14:47:00 xx.xx.xx.xx systemd[1]: Failed to start Reboot If Enabled.
root@xx.xx.xx.xx:~#

root@xx.xx.xx.xx:~# systemctl status ncsi-netlink.service | cat
* ncsi-netlink.service - Stop the ethernet link failover
     Loaded: loaded (/lib/systemd/system/ncsi-netlink.service; enabled; vendor preset: enabled)
     Active: failed (Result: start-limit-hit) since Fri 2017-02-17 14:46:49 UTC; 15min ago
    Process: 1190 ExecStart=/usr/bin/env ncsi-netlink --set -x 2 -p 0 -c 0 (code=exited, status=0/SUCCESS)
   Main PID: 1190 (code=exited, status=0/SUCCESS)

Feb 17 14:46:48 xx.xx.xx.xx systemd[1]: Starting Stop the ethernet link failover...
Feb 17 14:46:49 xx.xx.xx.xx systemd[1]: ncsi-netlink.service: Succeeded.
Feb 17 14:46:49 xx.xx.xx.xx systemd[1]: Finished Stop the ethernet link failover.
Feb 17 14:46:57 xx.xx.xx.xx systemd[1]: ncsi-netlink.service: Start request repeated too quickly.
Feb 17 14:46:57 xx.xx.xx.xx systemd[1]: ncsi-netlink.service: Failed with result 'start-limit-hit'.
Feb 17 14:46:57 xx.xx.xx.xx systemd[1]: Failed to start Stop the ethernet link failover.
root@xx.xx.xx.xx:~#

root@xx.xx.xx.xx:~# journalctl --no-pager -b | grep obmc-flash-bmc-setenv.service
root@xx.xx.xx.xx:~# systemctl status phosphor-reboot-host@0.service | cat
* phosphor-reboot-host@0.service - Reboot host0
     Loaded: loaded (/lib/systemd/system/phosphor-reboot-host@.service; static; vendor preset: enabled)
     Active: failed (Result: exit-code) since Fri 2017-02-17 14:46:59 UTC; 16min ago
    Process: 1187 ExecStart=/bin/sh -c sleep 5 && systemctl start obmc-host-startmin@0.target (code=exited, status=1/FAILURE)
   Main PID: 1187 (code=exited, status=1/FAILURE)

Feb 17 14:46:45 xx.xx.xx.xx systemd[1]: Started Reboot host0.
Feb 17 14:46:55 xx.xx.xx.xx sh[1187]: A dependency job for obmc-host-startmin@0.target failed. See 'journalctl -xe' for details.
Feb 17 14:46:59 xx.xx.xx.xx systemd[1]: phosphor-reboot-host@0.service: Main process exited, code=exited, status=1/FAILURE
Feb 17 14:46:59 xx.xx.xx.xx systemd[1]: phosphor-reboot-host@0.service: Failed with result 'exit-code'.
root@xx.xx.xx.xx:~#
geissonator commented 4 years ago

I can't find what changed upstream in systemd but it appears as if a target has the following in it now:

Wants=multi-user.target
After=multi-user.target

Systemd will try and start multi-user.target again. This is fine for service that are either not oneshot (i.e. keep running) or are oneshot and have RemainAfterExit=yes.

But, if you have a service that is oneshot but does not have RemainAfterExit=yes set and that service is required by multi-user.target, it appears it will now get restarted when a target has a Wants=multi-user.target!

Thinking about this, the Wants=multi-user.target really doesn't make sense. That target will always be run, so just the After= is really needed in this case for targets that want to ensure it has completed before their target can run.

And, it also highlights that if you have a oneshot service that should only be run once per boot of the BMC, the RemainAfterExit=yes really should be set.

geissonator commented 4 years ago

Fix is up over here - https://gerrit.openbmc-project.xyz/c/openbmc/phosphor-state-manager/+/32004