openbmc / openbmc

OpenBMC Distribution
Other
1.92k stars 914 forks source link

Intermittent start up issues #3369

Closed geissonator closed 5 years ago

geissonator commented 6 years ago

We've seen this a few times in CI. Something causes the BMC to get hung up during startup. Here's some data from one fail. A BMC reset fixes the issue:

Aug 23 20:32:49 w54 systemd-journald[968]: Journal started
Aug 23 20:32:49 w54 systemd-journald[968]: Runtime journal (/run/log/journal/c15ca447419348cd916bee35e4d4fb3e) is 8.0M, max 64.0M, 56.0M free.
Aug 23 20:32:49 w54 systemd-sysctl[596]: Couldn't write '16' to 'kernel/sysrq', ignoring: No such file or directory
Aug 23 20:32:49 w54 systemd-sysctl[596]: Couldn't write 'fq_codel' to 'net/core/default_qdisc', ignoring: No such file or directory
Aug 23 20:32:49 w54 systemd[1]: Starting Flush Journal to Persistent Storage...
Aug 23 20:32:49 w54 systemd-journald[968]: Time spent on flushing to /var is 28.460ms for 5 entries.
Aug 23 20:32:49 w54 systemd-journald[968]: System journal (/var/log/journal/c15ca447419348cd916bee35e4d4fb3e) is 2.5M, max 2.5M, 0B free.
Aug 23 20:32:49 w54 systemd[1]: Started Flush Journal to Persistent Storage.
Aug 23 20:32:50 w54 systemd-networkd[893]: Enumeration completed
Aug 23 20:32:50 w54 systemd[1]: Starting Create Volatile Files and Directories...
Aug 23 20:32:50 w54 systemd[1]: Started Network Service.
Aug 23 20:32:50 w54 systemd-udevd[983]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 23 20:32:50 w54 systemd-networkd[893]: eth0: IPv6 successfully enabled
Aug 23 20:32:51 w54 systemd-networkd[893]: eth0: Gained carrier
Aug 23 20:32:52 w54 systemd[1]: Started Create Volatile Files and Directories.
Aug 23 20:32:52 w54 systemd-networkd[893]: eth0: Gained IPv6LL
Aug 23 20:32:52 w54 systemd[1]: Found device /dev/ttyS4.
Aug 23 20:32:52 w54 systemd[1]: Found device /dev/ttyVUART0.
Aug 23 20:32:53 w54 systemd[1]: Found device /dev/aspeed-mbox.
Aug 23 20:32:53 w54 systemd[1]: Found device /dev/aspeed-lpc-ctrl.
Aug 23 20:32:55 w54 systemd[1]: Created slice system-xyz.openbmc_project.led.controller.slice.
Aug 23 20:32:55 w54 systemd[1]: Created slice system-max31785\x2dhwmon\x2dhelper.slice.
Aug 23 20:32:55 w54 systemd[1]: Created slice system-xyz.openbmc_project.Hwmon.slice.
Aug 23 20:32:55 w54 systemd[1]: Starting Network Time Synchronization...
Aug 23 20:32:56 w54 systemd[1]: Starting Network Name Resolution...
Aug 23 20:32:57 w54 systemd[1]: Started Network Time Synchronization.
Aug 23 20:32:57 w54 systemd-resolved[1012]: Positive Trust Anchors:
Aug 23 20:32:57 w54 systemd-resolved[1012]: . IN DS 19036 8 2 49aac11d7b6f6446702e54a1607371607a1a41855200fd2ce1cdde32f24e8fb5
Aug 23 20:32:57 w54 systemd-resolved[1012]: . IN DS 20326 8 2 e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d
Aug 23 20:32:57 w54 systemd-resolved[1012]: Negative trust anchors: 10.in-addr.arpa 16.172.in-addr.arpa 17.172.in-addr.arpa 18.172.in-addr.arpa 19.172.in-addr.arpa 20.172.in-addr.arpa 21.172.in-addr.arpa 22.172.in-addr.arpa 23.172.in-addr.arpa 24.172.in-addr.arpa 25.172.in-addr.arpa 26.172.in-addr.arpa 27.172.in-addr.arpa 28.172.in-addr.arpa 29.172.in-addr.arpa 30.172.in-addr.arpa 31.172.in-addr.arpa 168.192.in-addr.arpa d.f.ip6.arpa corp home internal intranet lan local private test
Aug 23 20:32:57 w54 systemd-resolved[1012]: Using system hostname 'w54'.
Aug 23 20:32:57 w54 systemd[1]: Started Network Name Resolution.
Aug 23 20:32:57 w54 systemd[1]: Reached target Host and Network Name Lookups.
Aug 23 20:32:57 w54 systemd[1]: Reached target Network.
Aug 23 20:32:57 w54 systemd[1]: Reached target System Initialization.
Aug 23 20:32:57 w54 systemd[1]: Listening on Phosphor DBUS Browser socket.
Aug 23 20:32:56 w54 systemd-timesyncd[1011]: Synchronized to time server 216.239.35.4:123 (time2.google.com).
Aug 23 20:32:56 w54 systemd[1]: Listening on Phosphor Host Console SSH Per-Connection socket.
Aug 23 20:32:57 w54 systemd[1]: Listening on Phosphor Webserver socket.
Aug 23 20:32:57 w54 systemd[1]: Started Daily Cleanup of Temporary Directories.
Aug 23 20:32:57 w54 systemd[1]: Reached target Timers.
Aug 23 20:32:57 w54 systemd[1]: Listening on phosphor-ipmi-net.socket.
Aug 23 20:32:57 w54 systemd[1]: Listening on Phosphor redfish socket (bmcweb).
Aug 23 20:32:57 w54 systemd-networkd[893]: eth0: Configured
Aug 23 20:32:57 w54 systemd[1]: Listening on D-Bus System Message Bus Socket.
Aug 23 20:32:57 w54 systemd-timesyncd[1011]: Network configuration changed, trying to establish connection.
Aug 23 20:32:57 w54 systemd[1]: Listening on dropbear.socket.
Aug 23 20:32:57 w54 systemd-timesyncd[1011]: Synchronized to time server 216.239.35.4:123 (time2.google.com).
Aug 23 20:32:57 w54 systemd[1]: Listening on Avahi mDNS/DNS-SD Stack Activation Socket.
Aug 23 20:32:57 w54 systemd[1]: Reached target Sockets.
Aug 23 20:32:57 w54 systemd[1]: Reached target Basic System.
Aug 23 20:32:57 w54 systemd[1]: Starting Hostboot Boot Settings...
Aug 23 20:32:57 w54 systemd[1]: Starting Check PNOR UBI format...
Aug 23 20:32:57 w54 systemd[1]: Starting Phosphor IPMI BT DBus Bridge...
Aug 23 20:32:57 w54 hb_settings[1015]: Hostboot serial output on VUART
Aug 23 20:32:57 w54 systemd[1]: Started Hardware RNG Entropy Gatherer Daemon.
Aug 23 20:32:57 w54 systemd[1]: Started System Logging Service.
Aug 23 20:32:58 w54 systemd[1]: Started Phosphor Dump core monitor..
Aug 23 20:32:58 w54 systemd[1]: Starting Phosphor Log Manager...
Aug 23 20:32:58 w54 systemd[1]: Starting Phosphor Settings Daemon...
Aug 23 20:32:58 w54 systemd[1]: Starting SSH Key Generation...
Aug 23 20:32:58 w54 systemd[1]: Starting Set U-Boot environment variable...
Aug 23 20:32:58 w54 systemd[1]: Starting Permit User Sessions...
Aug 23 20:32:58 w54 systemd[1]: Started Phosphor sysfs LED controller.
Aug 23 20:32:59 w54 systemd[1]: Starting Phosphor Dump Manager...
Aug 23 20:32:59 w54 systemd[1]: Started Kernel Logging Service.
Aug 23 20:32:59 w54 systemd[1]: Starting Phosphor User Manager...
Aug 23 20:32:59 w54 systemd[1]: Starting Phosphor Chassis0 Control...
Aug 23 20:32:59 w54 systemd[1]: Starting Remount the squashfs and ubi volumes after a reboot...
Aug 23 20:32:59 w54 systemd[1]: Started Lightweight SLP Server.
Aug 23 20:32:59 w54 systemd[1]: Started Phosphor sysfs LED controller.
Aug 23 20:33:00 w54 systemd[1]: Starting Clear one time boot overrides...
Aug 23 20:33:00 w54 systemd[1]: Starting Phosphor Download Manager...
Aug 23 20:33:00 w54 systemd[1]: Started Phosphor Sync Manager that syncs files to the alternate BMC chip.
Aug 23 20:33:00 w54 systemd[1]: Starting Phosphor LED Group Management Daemon...
Aug 23 20:33:00 w54 systemd[1]: Starting Phosphor Inventory Manager...
Aug 23 20:33:00 w54 systemd[1]: Started D-Bus System Message Bus.
Aug 23 20:33:01 w54 obmc-flash-bios[1036]: UBI device number 8, total 2048 LEBs (133955584 bytes, 127.7 MiB), available 520 LEBs (34012160 bytes, 32.4 MiB), LEB size 65408 bytes (63.8 KiB)
Aug 23 20:33:04 w54 systemd[1]: Started Phosphor IPMI BT DBus Bridge.
Aug 23 20:33:04 w54 systemd[1]: Started Phosphor Inventory Manager.
Aug 23 20:33:04 w54 systemd[1]: Started Phosphor Log Manager.
Aug 23 20:33:04 w54 systemd[1]: Started Phosphor Download Manager.
Aug 23 20:33:04 w54 systemd[1]: Started Phosphor Settings Daemon.
Aug 23 20:33:04 w54 systemd[1]: Starting IBM Log Manager...
Aug 23 20:33:05 w54 systemd[1]: Starting Phosphor DBus Service Discovery Manager...
Aug 23 20:33:05 w54 systemd[1]: Starting Avahi mDNS/DNS-SD Stack...
Aug 23 20:33:05 w54 systemd[1]: Starting The NGINX HTTP and reverse proxy server...
Aug 23 20:33:05 w54 systemd[1]: Started Phosphor Console Muxer listening on device /dev/ttyVUART0.
Aug 23 20:33:06 w54 systemd[1]: Reached target System Time Synchronized.
Aug 23 20:33:15 w54 phosphor-mapper[1069]: ObjectMapper startup complete.  Discovery in progress...
Aug 23 20:33:18 w54 systemd[1]: Started Hostboot Boot Settings.
Aug 23 20:33:19 w54 systemd[1]: Started Check PNOR UBI format.
Aug 23 20:33:19 w54 systemd[1]: Started SSH Key Generation.
Aug 23 20:33:19 w54 systemd[1]: Started Set U-Boot environment variable.
Aug 23 20:33:20 w54 systemd[1]: Started Permit User Sessions.
Aug 23 20:33:21 w54 systemd[1]: Started Clear one time boot overrides.
Aug 23 20:33:26 w54 nginx[1086]: nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
Aug 23 20:33:26 w54 chassis_control.py[1035]: Running ChassisControlService
Aug 23 20:33:27 w54 nginx[1086]: nginx: configuration file /etc/nginx/nginx.conf test is successful
Aug 23 20:33:27 w54 systemd[1]: Started Remount the squashfs and ubi volumes after a reboot.
Aug 23 20:33:28 w54 systemd[1]: Started The NGINX HTTP and reverse proxy server.
Aug 23 20:33:28 w54 systemd[1]: Created slice system-dropbear.slice.
Aug 23 20:33:28 w54 systemd[1]: Started SSH Per-Connection Server (9.41.165.103:39922).
Aug 23 20:33:28 w54 systemd[1]: Started SSH Per-Connection Server (9.41.165.103:39924).
Aug 23 20:33:29 w54 phosphor-mapper[1069]: org.freedesktop.systemd1 discovery failure on / - retry
Aug 23 20:33:29 w54 systemd[1]: Started SSH Per-Connection Server (9.3.62.215:54388).
Aug 23 20:33:29 w54 systemd[1]: Started Serial Getty on ttyS4.
Aug 23 20:33:29 w54 systemd[1]: Reached target Login Prompts.
Aug 23 20:33:29 w54 systemd[1]: Starting Remount the ubi volume after a reboot...
Aug 23 20:33:35 w54 obmc-flash-bmc[1125]: wdt2bite=mw.l 0x1e785024 0xa 1; mw.b 0x1e78502c 0xb3 1
Aug 23 20:33:35 w54 systemd[1]: Started Remount the ubi volume after a reboot.
Aug 23 20:33:51 w54 systemd-udevd[717]: seq 735 '/devices/platform/ahb/ahb:apb/ahb:apb:i2c@1e78a000/1e78a100.i2c-bus/i2c-3/3-0052/hwmon/hwmon9' is taking a long time
Aug 23 20:33:51 w54 systemd-udevd[717]: seq 761 '/devices/platform/ahb/ahb:apb/ahb:apb:i2c@1e78a000/1e78a180.i2c-bus/i2c-5/5-004c/hwmon/hwmon2' is taking a long time
Aug 23 20:33:51 w54 systemd-udevd[717]: seq 765 '/devices/platform/ahb/ahb:apb/ahb:apb:i2c@1e78a000/1e78a180.i2c-bus/i2c-5/5-0071/hwmon/hwmon8' is taking a long time
Aug 23 20:33:51 w54 systemd-udevd[717]: seq 763 '/devices/platform/ahb/ahb:apb/ahb:apb:i2c@1e78a000/1e78a180.i2c-bus/i2c-5/5-0070/hwmon/hwmon7' is taking a long time
Aug 23 20:33:51 w54 systemd-udevd[717]: seq 743 '/devices/platform/ahb/ahb:apb/ahb:apb:i2c@1e78a000/1e78a100.i2c-bus/i2c-3/3-0069/hwmon/hwmon4' is taking a long time
Aug 23 20:33:51 w54 systemd-udevd[717]: seq 756 '/devices/platform/ahb/ahb:apb/ahb:apb:i2c@1e78a000/1e78a140.i2c-bus/i2c-4/4-0071/hwmon/hwmon6' is taking a long time
Aug 23 20:33:51 w54 systemd-udevd[717]: seq 754 '/devices/platform/ahb/ahb:apb/ahb:apb:i2c@1e78a000/1e78a140.i2c-bus/i2c-4/4-0070/hwmon/hwmon5' is taking a long time
Aug 23 20:33:51 w54 systemd-udevd[717]: seq 752 '/devices/platform/ahb/ahb:apb/ahb:apb:i2c@1e78a000/1e78a140.i2c-bus/i2c-4/4-004c/hwmon/hwmon1' is taking a long time
Aug 23 20:33:51 w54 systemd-udevd[717]: seq 740 '/devices/platform/ahb/ahb:apb/ahb:apb:i2c@1e78a000/1e78a100.i2c-bus/i2c-3/3-0068/hwmon/hwmon3' is taking a long time
Aug 23 20:33:51 w54 systemd-udevd[717]: seq 747 '/devices/platform/ahb/ahb:apb/ahb:apb:i2c@1e78a000/1e78a100.i2c-bus/i2c-3/3-0077/iio:device0' is taking a long time
Aug 23 20:33:54 w54 phosphor-mapper[1069]: ObjectMapper discovery complete
Aug 23 20:34:29 w54 systemd[1]: xyz.openbmc_project.User.Manager.service: Start operation timed out. Terminating.
Aug 23 20:34:29 w54 systemd[1]: xyz.openbmc_project.User.Manager.service: Failed with result 'timeout'.
Aug 23 20:34:29 w54 systemd[1]: Failed to start Phosphor User Manager.
Aug 23 20:34:30 w54 systemd[1]: xyz.openbmc_project.LED.GroupManager.service: Start operation timed out. Terminating.
Aug 23 20:34:30 w54 systemd[1]: xyz.openbmc_project.LED.GroupManager.service: Failed with result 'timeout'.
Aug 23 20:34:30 w54 systemd[1]: Failed to start Phosphor LED Group Management Daemon.
Aug 23 20:34:31 w54 systemd[1]: xyz.openbmc_project.User.Manager.service: Service hold-off time over, scheduling restart.
Aug 23 20:34:31 w54 systemd[1]: xyz.openbmc_project.User.Manager.service: Scheduled restart job, restart counter is at 1.
Aug 23 20:34:31 w54 systemd[1]: Stopped Phosphor User Manager.
Aug 23 20:34:31 w54 systemd[1]: Starting Phosphor User Manager...
Aug 23 20:34:32 w54 systemd[1]: xyz.openbmc_project.LED.GroupManager.service: Service hold-off time over, scheduling restart.
Aug 23 20:34:32 w54 systemd[1]: xyz.openbmc_project.LED.GroupManager.service: Scheduled restart job, restart counter is at 1.
Aug 23 20:34:32 w54 systemd[1]: Stopped Phosphor LED Group Management Daemon.
Aug 23 20:34:32 w54 systemd[1]: Starting Phosphor LED Group Management Daemon...
Aug 23 20:34:35 w54 systemd[1]: com.ibm.Logging.service: Start operation timed out. Terminating.
Aug 23 20:34:35 w54 systemd[1]: com.ibm.Logging.service: Failed with result 'timeout'.
Aug 23 20:34:35 w54 systemd[1]: Failed to start IBM Log Manager.
Aug 23 20:34:35 w54 systemd[1]: avahi-daemon.service: Start operation timed out. Terminating.
Aug 23 20:34:35 w54 systemd[1]: avahi-daemon.service: Failed with result 'timeout'.
Aug 23 20:34:35 w54 systemd[1]: Failed to start Avahi mDNS/DNS-SD Stack.
Aug 23 20:34:35 w54 systemd[1]: Reached target Multi-User System.
Aug 23 20:34:36 w54 systemd[1]: com.ibm.Logging.service: Service hold-off time over, scheduling restart.
Aug 23 20:34:36 w54 systemd[1]: com.ibm.Logging.service: Scheduled restart job, restart counter is at 1.
Aug 23 20:34:36 w54 systemd[1]: Stopped IBM Log Manager.
Aug 23 20:34:36 w54 systemd[1]: Starting IBM Log Manager...
[w54]# cat /etc/os-release 
ID="openbmc-phosphor"
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="v2.3-192"
VERSION_ID="v2.3-192-gcb7487b-dirty"
PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v2.3-192"
BUILD_ID="v2.3"

[w54]# uname -a
Linux w54 4.17.14-c71662c749dc66a542e717dbd51fefab995c9455 #1 Thu Aug 23 19:05:04 UTC 2018 armv6l GNU/Linux
geissonator commented 6 years ago

This hit again today in a test which was verifying master:

[w54]# cat /etc/os-release 
ID="openbmc-phosphor"
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="v2.3-204"
VERSION_ID="v2.3-204-g627b36608"
PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v2.3-204"
BUILD_ID="v2.3"

I still don't see what's happening. Just lots of timeouts. Fails like this.

Aug 28 16:13:28 w54 systemd[1]: Starting OpenBMC Software Update Manager...
Aug 28 16:13:29 w54 phosphor-image-updater[7992]: Error in mapper GetSubTreePath
Aug 28 16:13:31 w54 phosphor-image-updater[7992]: terminate called after throwing an instance of 'sdbusplus::exception::SdBusError'
Aug 28 16:13:31 w54 phosphor-image-updater[7992]:   what():  sd_bus_call noreply: org.freedesktop.DBus.Error.TimedOut: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)

I manually stopped the services then re-enabled them and everything is fine. So something bad goes on early in the boot that causes all the services to get backed up it seems.

geissonator commented 6 years ago

Hit again, something up with mapper:

Sep 04 14:25:02 w54 phosphor-mapper[11923]: org.openbmc.managers.System discovery failure on /org - retry
Sep 04 14:25:02 w54 phosphor-mapper[11923]: org.openbmc.managers.System discovery failure on /
Sep 04 14:25:02 w54 phosphor-mapper[11923]: Traceback (most recent call last):
Sep 04 14:25:02 w54 phosphor-mapper[11923]:   File "/usr/lib/python2.7/site-packages/obmc/mapper/server.py", line 65, in __call__
Sep 04 14:25:02 w54 phosphor-mapper[11923]:     self._find_interfaces(path)
Sep 04 14:25:02 w54 phosphor-mapper[11923]:   File "/usr/lib/python2.7/site-packages/obmc/mapper/server.py", line 162, in _find_interfaces
Sep 04 14:25:02 w54 phosphor-mapper[11923]:     obj = conn.get_object(service, path, introspect=False)
Sep 04 14:25:02 w54 phosphor-mapper[11923]:   File "/usr/lib/python2.7/site-packages/dbus/bus.py", line 241, in get_object
Sep 04 14:25:02 w54 phosphor-mapper[11923]:     follow_name_owner_changes=follow_name_owner_changes)
Sep 04 14:25:02 w54 phosphor-mapper[11923]:   File "/usr/lib/python2.7/site-packages/dbus/proxies.py", line 248, in __init__
Sep 04 14:25:02 w54 phosphor-mapper[11923]:     self._named_service = conn.activate_name_owner(bus_name)
Sep 04 14:25:02 w54 phosphor-mapper[11923]:   File "/usr/lib/python2.7/site-packages/dbus/bus.py", line 180, in activate_name_owner
Sep 04 14:25:02 w54 phosphor-mapper[11923]:     self.start_service_by_name(bus_name)
Sep 04 14:25:02 w54 phosphor-mapper[11923]:   File "/usr/lib/python2.7/site-packages/dbus/bus.py", line 278, in start_service_by_name
Sep 04 14:25:02 w54 phosphor-mapper[11923]:     'su', (bus_name, flags)))
Sep 04 14:25:02 w54 phosphor-mapper[11923]:   File "/usr/lib/python2.7/site-packages/dbus/connection.py", line 651, in call_blocking
Sep 04 14:25:02 w54 phosphor-mapper[11923]:     message, timeout)
Sep 04 14:25:02 w54 phosphor-mapper[11923]: DBusException: org.freedesktop.DBus.Error.ServiceUnknown: The name org.openbmc.managers.System was not provided by any .service files
geissonator commented 6 years ago

Hit a few more times the last few days, one interesting thing when it hits:

# uptime
 19:20:13 up 7 min,  load average: 19.17, 7.81, 3.02

Something seems to be hammering the CPU, which could be whats leading to all the service timeouts. Tough to tell, there was nothing obvious in "top" showing what was taking all the cycles.

gkeishin commented 6 years ago

Now we see this on master too for the recent build https://openpower.xyz/job/openbmc-build/1649/

"v2.3-284-g95949a1"

geissonator commented 6 years ago

When a system gets in the fail state, running "systemctl disable systemd-coredump@.service" seems to cause the cpu thrashing to stop.

One theory is that some early (and critical) service is hitting an error or timeout, that then causes a cascading affect of core dumps from the different services, that then causes the systemd-coredump service to start hammering at the filesystem and cpu, which then just continues the cpu thrashing and service crashing.

The recent addition of REST server logging and the BPF logging issue may be contributing to the initial failure by adding stress and cpu cycles to the systemd-journald process. Both have fixes in flight.

geissonator commented 6 years ago

Some mapper data from this failure:

Sep 25 19:14:42 witherspoon phosphor-mapper[1063]: ObjectMapper startup complete.  Discovery in progress...
Sep 25 19:15:06 witherspoon phosphor-mapper[1063]: org.freedesktop.systemd1 discovery failure on / - retry
Sep 25 19:15:31 witherspoon phosphor-mapper[1063]: ObjectMapper discovery complete
Sep 25 19:19:50 witherspoon phosphor-mapper[1254]: ObjectMapper startup complete.  Discovery in progress...
Sep 25 19:20:23 witherspoon phosphor-mapper[1254]: xyz.openbmc_project.Network discovery failure on / - retry
Sep 25 19:20:27 witherspoon phosphor-mapper[1254]: xyz.openbmc_project.Control.Host discovery failure on / - retry
Sep 25 19:20:58 witherspoon phosphor-mapper[1254]: xyz.openbmc_project.Network discovery failure on /
Sep 25 19:20:58 witherspoon phosphor-mapper[1254]: DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Sep 25 19:20:58 witherspoon phosphor-mapper[1254]: Traceback (most recent call last):
Sep 25 19:20:58 witherspoon phosphor-mapper[1254]:   File "/usr/lib/python2.7/site-packages/dbus/connection.py", line 607, in msg_reply_handler
Sep 25 19:20:58 witherspoon phosphor-mapper[1254]:     *message.get_args_list()))
geissonator commented 6 years ago

We don't see this on hardware anymore, but we do still see intermittently in our Romulus QEMU runs. Captured the first service to fail after mapper discovery complete - xyz.openbmc_project.Syslog.Config.service

Oct 22 01:36:19 romulus avahi-daemon[913]: Service "romulus" (/services/ssh.service) successfully established.
Oct 22 01:36:19 romulus avahi-daemon[913]: Service "romulus" (/services/sftp-ssh.service) successfully established.
Oct 22 01:36:20 romulus systemd[1]: Started Phosphor Console Muxer listening on device /dev/ttyVUART0.
Oct 22 01:36:19 romulus avahi-daemon[913]: Service "phosphor_rest" (/services/phosphor_rest.service) successfully established.
Oct 22 01:36:20 romulus systemd[1]: Starting Phosphor Chassis0 Control...
Oct 22 01:36:20 romulus avahi-daemon[913]: Service "obmc_console" (/services/obmc_console.service) successfully established.
Oct 22 01:36:21 romulus hb_settings[949]: Hostboot serial output on VUART
Oct 22 01:36:21 romulus systemd[1]: Started Phosphor sysfs LED controller.
Oct 22 01:36:21 romulus systemd[1]: Starting Phosphor Download Manager...
Oct 22 01:36:22 romulus systemd[1]: Started Phosphor sysfs LED controller.
Oct 22 01:36:22 romulus systemd[1]: Started Daily Cleanup of Temporary Directories.
Oct 22 01:36:22 romulus systemd[1]: Reached target Timers.
Oct 22 01:36:29 romulus nginx[927]: Generating a 2048 bit RSA private key
Oct 22 01:36:30 romulus nslcd[966]: version 0.9.8 starting
Oct 22 01:36:30 romulus nslcd[966]: accepting connections
Oct 22 01:36:31 romulus systemd[1]: Started System Logging Service.
Oct 22 01:36:31 romulus button_power.exe[933]: GPIO Lookup:  POWER_BUTTON = 347,both
Oct 22 01:36:31 romulus dbus-daemon[909]: [system] Failed to activate service 'org.freedesktop.hostname1': timed out (service_start_timeout=25000ms)
Oct 22 01:36:32 romulus systemd-networkd[697]: Could not set hostname: Method call timed out
Oct 22 01:36:32 romulus button_reset.exe[952]: GPIO Lookup:  RESET_BUTTON = 353,both
Oct 22 01:36:32 romulus power_control.exe[920]: Power GPIO power good input: SYS_PWROK_BUFF
Oct 22 01:36:32 romulus power_control.exe[920]: Power GPIO 2 power_up outputs
Oct 22 01:36:32 romulus power_control.exe[920]: Power GPIO power_up[0] = SOFTWARE_PGOOD active HIGH
Oct 22 01:36:32 romulus power_control.exe[920]: Power GPIO power_up[1] = BMC_POWER_UP active HIGH
Oct 22 01:36:32 romulus power_control.exe[920]: Power GPIO 0 reset outputs
Oct 22 01:36:32 romulus power_control.exe[920]: Power GPIO 0 pci reset outputs
Oct 22 01:36:32 romulus power_control.exe[920]: GPIO Lookup:  SYS_PWROK_BUFF = 306,in
Oct 22 01:36:32 romulus power_control.exe[920]: GPIO Lookup:  SOFTWARE_PGOOD = 417,out
Oct 22 01:36:32 romulus power_control.exe[920]: GPIO Lookup:  BMC_POWER_UP = 305,out
Oct 22 01:36:32 romulus power_control.exe[920]: Pgood state: 0
Oct 22 01:36:33 romulus systemd[1]: Started LDAP daemon.
Oct 22 01:36:33 romulus systemd[1]: Started Name Service Cache Daemon.
Oct 22 01:36:34 romulus systemd[1]: Started Permit User Sessions.
Oct 22 01:36:35 romulus systemd[1]: Started Hostboot Boot Settings.
Oct 22 01:36:35 romulus systemd[1]: Started Clear one time boot overrides.
Oct 22 01:36:37 romulus systemd[1]: Started Serial Getty on ttyS4.
Oct 22 01:36:37 romulus systemd[1]: Reached target Login Prompts.
Oct 22 01:36:39 romulus nginx[927]: ...........+++++
Oct 22 01:36:46 romulus nginx[927]: .............+++++
Oct 22 01:36:46 romulus nginx[927]: writing new private key to '/etc/ssl/certs/nginx/cert.pem'
Oct 22 01:36:46 romulus login[982]: PAM unable to dlopen(/lib/security/pam_ipmicheck.so): /lib/security/pam_ipmicheck.so: cannot open shared object file: No such file or directory
Oct 22 01:36:46 romulus login[982]: PAM adding faulty module: /lib/security/pam_ipmicheck.so
Oct 22 01:36:46 romulus nginx[927]: -----
Oct 22 01:36:47 romulus login[982]: PAM unable to dlopen(/lib/security/pam_ipmisave.so): /lib/security/pam_ipmisave.so: cannot open shared object file: No such file or directory
Oct 22 01:36:47 romulus login[982]: PAM adding faulty module: /lib/security/pam_ipmisave.so
Oct 22 01:36:47 romulus login[982]: PAM unable to dlopen(/lib/security/pam_systemd.so): /lib/security/pam_systemd.so: cannot open shared object file: No such file or directory
Oct 22 01:36:47 romulus login[982]: PAM adding faulty module: /lib/security/pam_systemd.so
Oct 22 01:36:47 romulus rngd[926]: too many FIPS failures, disabling entropy source
Oct 22 01:36:47 romulus rngd[926]: No entropy sources working, exiting rngd
Oct 22 01:36:48 romulus login[982]: pam_lastlog(login:session): file /var/log/lastlog created
Oct 22 01:36:48 romulus login[982]: pam_unix(login:session): session opened for user root by LOGIN(uid=0)
Oct 22 01:36:48 romulus login[985]: ROOT LOGIN  on '/dev/ttyS4'
Oct 22 01:36:50 romulus nginx[984]: nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
Oct 22 01:36:50 romulus chassis_control.py[955]: Running ChassisControlService
Oct 22 01:36:50 romulus nginx[984]: nginx: configuration file /etc/nginx/nginx.conf test is successful
Oct 22 01:36:51 romulus download_manager.py[957]: Running Download Manager
Oct 22 01:36:51 romulus systemd[1]: Started The NGINX HTTP and reverse proxy server.
Oct 22 01:36:52 romulus phosphor-mapper[936]: ObjectMapper startup complete.  Discovery in progress...
Oct 22 01:36:53 romulus avahi-daemon[924]: ../../avahi-0.7/avahi-daemon/chroot.c: open() failed: No such file or directory
Oct 22 01:36:53 romulus avahi-daemon[913]: Failed to open /usr/share/dbus-1/interfaces/org.freedesktop.Avahi.Server.xml: Invalid argument
Oct 22 01:36:53 romulus avahi-daemon[913]: Failed to load introspection data.
Oct 22 01:36:56 romulus systemd-udevd[572]: seq 620 '/devices/platform/ahb/ahb:apb/1e6e9000.adc/iio:device0' is taking a long time
Oct 22 01:36:56 romulus systemd-udevd[572]: seq 636 '/devices/platform/ahb/ahb:apb/1e786000.pwm-tacho-controller/hwmon/hwmon0' is taking a long time
Oct 22 01:36:58 romulus systemd-udevd[572]: seq 696 '/devices/platform/iio-hwmon-battery/hwmon/hwmon1' is taking a long time
Oct 22 01:36:58 romulus sh[906]: Public key portion is:
Oct 22 01:36:58 romulus sh[906]: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCs3EnMRjNEZxDXzboT45iFs37dloQFERVRFuCz4eWYriYqX3bbunYvOFkVHqSn5Yu76r1TDcz8AJl8XWuRDZ03ib9tpZWDdPtJNDUhVjyZuz2INd8vFY3smZCOMWbya1Atbo9blKFdrF+pbiyoo80oT7qsHaZ70gFOsGpC80+WlwRyQsuqNLXVEfkNTjrsOxdWUuj7LvTieUklL3Acax0EoXTxPsC+V4a9q+OYE+S6uLOmUEyD7J2BillrGUT2kFeURY78SHevxt3CsoZVtySOLLe41qyTJ39T7+EsGPf+ddOhnUDkFhCrxiYJGde7GyAxigvWhXArQqe1+6EGaO0j root@romulus
Oct 22 01:36:58 romulus sh[906]: Fingerprint: md5 48:06:ef:00:09:01:cf:48:58:c9:d8:ac:f7:ef:19:e6
Oct 22 01:36:58 romulus systemd[1]: Started SSH Key Generation.
Oct 22 01:37:00 romulus phosphor-mapper[936]: ObjectMapper discovery complete
Oct 22 01:37:16 romulus systemd[1]: Created slice system-dropbear.slice.
Oct 22 01:37:16 romulus systemd[1]: Started SSH Per-Connection Server (172.17.0.3:58158).
Oct 22 01:37:16 romulus dropbear[993]: Child connection from ::ffff:172.17.0.3:58158
Oct 22 01:37:17 romulus dropbear[993]: PAM password auth succeeded for 'root' from ::ffff:172.17.0.3:58158
Oct 22 01:37:35 romulus systemd[1]: xyz.openbmc_project.Syslog.Config.service: Start operation timed out. Terminating.
Oct 22 01:37:36 romulus systemd[1]: xyz.openbmc_project.Syslog.Config.service: Failed with result 'timeout'.
Oct 22 01:37:36 romulus systemd[1]: Failed to start Rsyslog config updater.
Oct 22 01:37:37 romulus systemd[1]: xyz.openbmc_project.Syslog.Config.service: Service hold-off time over, scheduling restart.
Oct 22 01:37:37 romulus systemd[1]: xyz.openbmc_project.Syslog.Config.service: Scheduled restart job, restart counter is at 1.
Oct 22 01:37:37 romulus systemd[1]: Stopped Rsyslog config updater.
Oct 22 01:37:37 romulus systemd[1]: Starting Rsyslog config updater...
Oct 22 01:37:39 romulus systemd[1]: xyz.openbmc_project.Logging.service: Start operation timed out. Terminating.
Oct 22 01:37:39 romulus systemd[1]: xyz.openbmc_project.Logging.service: Failed with result 'timeout'.
Oct 22 01:37:39 romulus systemd[1]: Failed to start Phosphor Log Manager.
Oct 22 01:37:39 romulus systemd[1]: Dependency failed for IBM Log Manager.
Oct 22 01:37:39 romulus systemd[1]: com.ibm.Logging.service: Job com.ibm.Logging.service/start failed with result 'dependency'.
Oct 22 01:37:39 romulus systemd[1]: xyz.openbmc_project.Inventory.Manager.service: Start operation timed out. Terminating.
Oct 22 01:37:39 romulus systemd[1]: xyz.openbmc_project.Inventory.Manager.service: Failed with result 'timeout'.
Oct 22 01:37:39 romulus systemd[1]: Failed to start Phosphor Inventory Manager.
Oct 22 01:37:39 romulus systemd[1]: avahi-daemon.service: Start operation timed out. Terminating.
Oct 22 01:37:39 romulus avahi-daemon[913]: Got SIGTERM, quitting.
geissonator commented 6 years ago

Got a change in recently to disable core dumps, which causes this problem to snowball. Managed to get some FFDC from a failure - https://openpower.xyz/job/openbmc-test-qemu-ci/5848/artifact/logs/20181029144041159347_TestsTestRestInterfaces/20181029144041159347_VerifyRESTPutMessageJSONCompliant/

I'm not sure if this is a symptom or cause but dbus-daemon runs into issues activating sysytemd targets when this fail hits:

Oct 29 14:26:15 romulus dbus-daemon[916]: [system] Failed to activate service 'org.freedesktop.hostname1': timed out (service_start_timeout=25000ms)
geissonator commented 5 years ago

The going theory is that this is a performance related issue. Recent changes have resulted in more applications being started during the initial boot of BMC. Also, when booting QEMU, it's treated as a genesis IPL so things like nginx and dropbear runs very intensive workloads to generate their appropriate keys.

https://svgshare.com/i/9DF.svg shows a systemd-bootchart of the boot to standby. Between time 45 and 95, the CPU is pegged (purple) and the CPU wait is pegged (yellow).

The top CPU consumers during this time are:

8883.3ms /usr/sbin/dropbearkey -t rsa -f /etc/dropbear/dropbear_rsa_host_key

8325.0ms python /usr/sbin/phosphor-mapper --service_namespaces=org.openbmc com.ibm xyz.openbmc_project org.open_power --interface_      namespaces=org.openbmc com.ibm xyz.openbmc_project org.open_power --service_blacklists= 

7503.3ms openssl req -x509 -sha256 -newkey rsa:2048 -keyout /etc/ssl/certs/nginx/cert.pem -out /etc/ssl/certs/nginx/cert.pem -days       3650 -subj /O=openbmc-project.xyz/CN=localhost -nodes

4651.6ms /lib/systemd/systemd-journald

3873.1ms [kworker/0:1

3503.7ms /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only

2433.6ms python /usr/sbin/chassis_control.py

2034.8ms rngd
geissonator commented 5 years ago

The good news is that the C++ mapper and the removal of nginx should really help out here. I could make some changes to try and artificially serialize some of these services but that seems to go against the grain of doing things as efficiently as possible.

geissonator commented 5 years ago

https://github.com/openbmc/qemu/issues/14 tracks an offshoot of this issue. The primary problem has been fixed by creating a dedicated QEMU job, i.e. https://openpower.xyz/view/CI/job/run-qemu-ci/