Closed geissonator closed 5 years ago
This hit again today in a test which was verifying master:
[w54]# cat /etc/os-release
ID="openbmc-phosphor"
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="v2.3-204"
VERSION_ID="v2.3-204-g627b36608"
PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v2.3-204"
BUILD_ID="v2.3"
I still don't see what's happening. Just lots of timeouts. Fails like this.
Aug 28 16:13:28 w54 systemd[1]: Starting OpenBMC Software Update Manager...
Aug 28 16:13:29 w54 phosphor-image-updater[7992]: Error in mapper GetSubTreePath
Aug 28 16:13:31 w54 phosphor-image-updater[7992]: terminate called after throwing an instance of 'sdbusplus::exception::SdBusError'
Aug 28 16:13:31 w54 phosphor-image-updater[7992]: what(): sd_bus_call noreply: org.freedesktop.DBus.Error.TimedOut: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
I manually stopped the services then re-enabled them and everything is fine. So something bad goes on early in the boot that causes all the services to get backed up it seems.
Hit again, something up with mapper:
Sep 04 14:25:02 w54 phosphor-mapper[11923]: org.openbmc.managers.System discovery failure on /org - retry
Sep 04 14:25:02 w54 phosphor-mapper[11923]: org.openbmc.managers.System discovery failure on /
Sep 04 14:25:02 w54 phosphor-mapper[11923]: Traceback (most recent call last):
Sep 04 14:25:02 w54 phosphor-mapper[11923]: File "/usr/lib/python2.7/site-packages/obmc/mapper/server.py", line 65, in __call__
Sep 04 14:25:02 w54 phosphor-mapper[11923]: self._find_interfaces(path)
Sep 04 14:25:02 w54 phosphor-mapper[11923]: File "/usr/lib/python2.7/site-packages/obmc/mapper/server.py", line 162, in _find_interfaces
Sep 04 14:25:02 w54 phosphor-mapper[11923]: obj = conn.get_object(service, path, introspect=False)
Sep 04 14:25:02 w54 phosphor-mapper[11923]: File "/usr/lib/python2.7/site-packages/dbus/bus.py", line 241, in get_object
Sep 04 14:25:02 w54 phosphor-mapper[11923]: follow_name_owner_changes=follow_name_owner_changes)
Sep 04 14:25:02 w54 phosphor-mapper[11923]: File "/usr/lib/python2.7/site-packages/dbus/proxies.py", line 248, in __init__
Sep 04 14:25:02 w54 phosphor-mapper[11923]: self._named_service = conn.activate_name_owner(bus_name)
Sep 04 14:25:02 w54 phosphor-mapper[11923]: File "/usr/lib/python2.7/site-packages/dbus/bus.py", line 180, in activate_name_owner
Sep 04 14:25:02 w54 phosphor-mapper[11923]: self.start_service_by_name(bus_name)
Sep 04 14:25:02 w54 phosphor-mapper[11923]: File "/usr/lib/python2.7/site-packages/dbus/bus.py", line 278, in start_service_by_name
Sep 04 14:25:02 w54 phosphor-mapper[11923]: 'su', (bus_name, flags)))
Sep 04 14:25:02 w54 phosphor-mapper[11923]: File "/usr/lib/python2.7/site-packages/dbus/connection.py", line 651, in call_blocking
Sep 04 14:25:02 w54 phosphor-mapper[11923]: message, timeout)
Sep 04 14:25:02 w54 phosphor-mapper[11923]: DBusException: org.freedesktop.DBus.Error.ServiceUnknown: The name org.openbmc.managers.System was not provided by any .service files
Hit a few more times the last few days, one interesting thing when it hits:
# uptime
19:20:13 up 7 min, load average: 19.17, 7.81, 3.02
Something seems to be hammering the CPU, which could be whats leading to all the service timeouts. Tough to tell, there was nothing obvious in "top" showing what was taking all the cycles.
Now we see this on master too for the recent build https://openpower.xyz/job/openbmc-build/1649/
"v2.3-284-g95949a1"
When a system gets in the fail state, running "systemctl disable systemd-coredump@.service" seems to cause the cpu thrashing to stop.
One theory is that some early (and critical) service is hitting an error or timeout, that then causes a cascading affect of core dumps from the different services, that then causes the systemd-coredump service to start hammering at the filesystem and cpu, which then just continues the cpu thrashing and service crashing.
The recent addition of REST server logging and the BPF logging issue may be contributing to the initial failure by adding stress and cpu cycles to the systemd-journald process. Both have fixes in flight.
Some mapper data from this failure:
Sep 25 19:14:42 witherspoon phosphor-mapper[1063]: ObjectMapper startup complete. Discovery in progress...
Sep 25 19:15:06 witherspoon phosphor-mapper[1063]: org.freedesktop.systemd1 discovery failure on / - retry
Sep 25 19:15:31 witherspoon phosphor-mapper[1063]: ObjectMapper discovery complete
Sep 25 19:19:50 witherspoon phosphor-mapper[1254]: ObjectMapper startup complete. Discovery in progress...
Sep 25 19:20:23 witherspoon phosphor-mapper[1254]: xyz.openbmc_project.Network discovery failure on / - retry
Sep 25 19:20:27 witherspoon phosphor-mapper[1254]: xyz.openbmc_project.Control.Host discovery failure on / - retry
Sep 25 19:20:58 witherspoon phosphor-mapper[1254]: xyz.openbmc_project.Network discovery failure on /
Sep 25 19:20:58 witherspoon phosphor-mapper[1254]: DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Sep 25 19:20:58 witherspoon phosphor-mapper[1254]: Traceback (most recent call last):
Sep 25 19:20:58 witherspoon phosphor-mapper[1254]: File "/usr/lib/python2.7/site-packages/dbus/connection.py", line 607, in msg_reply_handler
Sep 25 19:20:58 witherspoon phosphor-mapper[1254]: *message.get_args_list()))
We don't see this on hardware anymore, but we do still see intermittently in our Romulus QEMU runs. Captured the first service to fail after mapper discovery complete - xyz.openbmc_project.Syslog.Config.service
Oct 22 01:36:19 romulus avahi-daemon[913]: Service "romulus" (/services/ssh.service) successfully established.
Oct 22 01:36:19 romulus avahi-daemon[913]: Service "romulus" (/services/sftp-ssh.service) successfully established.
Oct 22 01:36:20 romulus systemd[1]: Started Phosphor Console Muxer listening on device /dev/ttyVUART0.
Oct 22 01:36:19 romulus avahi-daemon[913]: Service "phosphor_rest" (/services/phosphor_rest.service) successfully established.
Oct 22 01:36:20 romulus systemd[1]: Starting Phosphor Chassis0 Control...
Oct 22 01:36:20 romulus avahi-daemon[913]: Service "obmc_console" (/services/obmc_console.service) successfully established.
Oct 22 01:36:21 romulus hb_settings[949]: Hostboot serial output on VUART
Oct 22 01:36:21 romulus systemd[1]: Started Phosphor sysfs LED controller.
Oct 22 01:36:21 romulus systemd[1]: Starting Phosphor Download Manager...
Oct 22 01:36:22 romulus systemd[1]: Started Phosphor sysfs LED controller.
Oct 22 01:36:22 romulus systemd[1]: Started Daily Cleanup of Temporary Directories.
Oct 22 01:36:22 romulus systemd[1]: Reached target Timers.
Oct 22 01:36:29 romulus nginx[927]: Generating a 2048 bit RSA private key
Oct 22 01:36:30 romulus nslcd[966]: version 0.9.8 starting
Oct 22 01:36:30 romulus nslcd[966]: accepting connections
Oct 22 01:36:31 romulus systemd[1]: Started System Logging Service.
Oct 22 01:36:31 romulus button_power.exe[933]: GPIO Lookup: POWER_BUTTON = 347,both
Oct 22 01:36:31 romulus dbus-daemon[909]: [system] Failed to activate service 'org.freedesktop.hostname1': timed out (service_start_timeout=25000ms)
Oct 22 01:36:32 romulus systemd-networkd[697]: Could not set hostname: Method call timed out
Oct 22 01:36:32 romulus button_reset.exe[952]: GPIO Lookup: RESET_BUTTON = 353,both
Oct 22 01:36:32 romulus power_control.exe[920]: Power GPIO power good input: SYS_PWROK_BUFF
Oct 22 01:36:32 romulus power_control.exe[920]: Power GPIO 2 power_up outputs
Oct 22 01:36:32 romulus power_control.exe[920]: Power GPIO power_up[0] = SOFTWARE_PGOOD active HIGH
Oct 22 01:36:32 romulus power_control.exe[920]: Power GPIO power_up[1] = BMC_POWER_UP active HIGH
Oct 22 01:36:32 romulus power_control.exe[920]: Power GPIO 0 reset outputs
Oct 22 01:36:32 romulus power_control.exe[920]: Power GPIO 0 pci reset outputs
Oct 22 01:36:32 romulus power_control.exe[920]: GPIO Lookup: SYS_PWROK_BUFF = 306,in
Oct 22 01:36:32 romulus power_control.exe[920]: GPIO Lookup: SOFTWARE_PGOOD = 417,out
Oct 22 01:36:32 romulus power_control.exe[920]: GPIO Lookup: BMC_POWER_UP = 305,out
Oct 22 01:36:32 romulus power_control.exe[920]: Pgood state: 0
Oct 22 01:36:33 romulus systemd[1]: Started LDAP daemon.
Oct 22 01:36:33 romulus systemd[1]: Started Name Service Cache Daemon.
Oct 22 01:36:34 romulus systemd[1]: Started Permit User Sessions.
Oct 22 01:36:35 romulus systemd[1]: Started Hostboot Boot Settings.
Oct 22 01:36:35 romulus systemd[1]: Started Clear one time boot overrides.
Oct 22 01:36:37 romulus systemd[1]: Started Serial Getty on ttyS4.
Oct 22 01:36:37 romulus systemd[1]: Reached target Login Prompts.
Oct 22 01:36:39 romulus nginx[927]: ...........+++++
Oct 22 01:36:46 romulus nginx[927]: .............+++++
Oct 22 01:36:46 romulus nginx[927]: writing new private key to '/etc/ssl/certs/nginx/cert.pem'
Oct 22 01:36:46 romulus login[982]: PAM unable to dlopen(/lib/security/pam_ipmicheck.so): /lib/security/pam_ipmicheck.so: cannot open shared object file: No such file or directory
Oct 22 01:36:46 romulus login[982]: PAM adding faulty module: /lib/security/pam_ipmicheck.so
Oct 22 01:36:46 romulus nginx[927]: -----
Oct 22 01:36:47 romulus login[982]: PAM unable to dlopen(/lib/security/pam_ipmisave.so): /lib/security/pam_ipmisave.so: cannot open shared object file: No such file or directory
Oct 22 01:36:47 romulus login[982]: PAM adding faulty module: /lib/security/pam_ipmisave.so
Oct 22 01:36:47 romulus login[982]: PAM unable to dlopen(/lib/security/pam_systemd.so): /lib/security/pam_systemd.so: cannot open shared object file: No such file or directory
Oct 22 01:36:47 romulus login[982]: PAM adding faulty module: /lib/security/pam_systemd.so
Oct 22 01:36:47 romulus rngd[926]: too many FIPS failures, disabling entropy source
Oct 22 01:36:47 romulus rngd[926]: No entropy sources working, exiting rngd
Oct 22 01:36:48 romulus login[982]: pam_lastlog(login:session): file /var/log/lastlog created
Oct 22 01:36:48 romulus login[982]: pam_unix(login:session): session opened for user root by LOGIN(uid=0)
Oct 22 01:36:48 romulus login[985]: ROOT LOGIN on '/dev/ttyS4'
Oct 22 01:36:50 romulus nginx[984]: nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
Oct 22 01:36:50 romulus chassis_control.py[955]: Running ChassisControlService
Oct 22 01:36:50 romulus nginx[984]: nginx: configuration file /etc/nginx/nginx.conf test is successful
Oct 22 01:36:51 romulus download_manager.py[957]: Running Download Manager
Oct 22 01:36:51 romulus systemd[1]: Started The NGINX HTTP and reverse proxy server.
Oct 22 01:36:52 romulus phosphor-mapper[936]: ObjectMapper startup complete. Discovery in progress...
Oct 22 01:36:53 romulus avahi-daemon[924]: ../../avahi-0.7/avahi-daemon/chroot.c: open() failed: No such file or directory
Oct 22 01:36:53 romulus avahi-daemon[913]: Failed to open /usr/share/dbus-1/interfaces/org.freedesktop.Avahi.Server.xml: Invalid argument
Oct 22 01:36:53 romulus avahi-daemon[913]: Failed to load introspection data.
Oct 22 01:36:56 romulus systemd-udevd[572]: seq 620 '/devices/platform/ahb/ahb:apb/1e6e9000.adc/iio:device0' is taking a long time
Oct 22 01:36:56 romulus systemd-udevd[572]: seq 636 '/devices/platform/ahb/ahb:apb/1e786000.pwm-tacho-controller/hwmon/hwmon0' is taking a long time
Oct 22 01:36:58 romulus systemd-udevd[572]: seq 696 '/devices/platform/iio-hwmon-battery/hwmon/hwmon1' is taking a long time
Oct 22 01:36:58 romulus sh[906]: Public key portion is:
Oct 22 01:36:58 romulus sh[906]: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCs3EnMRjNEZxDXzboT45iFs37dloQFERVRFuCz4eWYriYqX3bbunYvOFkVHqSn5Yu76r1TDcz8AJl8XWuRDZ03ib9tpZWDdPtJNDUhVjyZuz2INd8vFY3smZCOMWbya1Atbo9blKFdrF+pbiyoo80oT7qsHaZ70gFOsGpC80+WlwRyQsuqNLXVEfkNTjrsOxdWUuj7LvTieUklL3Acax0EoXTxPsC+V4a9q+OYE+S6uLOmUEyD7J2BillrGUT2kFeURY78SHevxt3CsoZVtySOLLe41qyTJ39T7+EsGPf+ddOhnUDkFhCrxiYJGde7GyAxigvWhXArQqe1+6EGaO0j root@romulus
Oct 22 01:36:58 romulus sh[906]: Fingerprint: md5 48:06:ef:00:09:01:cf:48:58:c9:d8:ac:f7:ef:19:e6
Oct 22 01:36:58 romulus systemd[1]: Started SSH Key Generation.
Oct 22 01:37:00 romulus phosphor-mapper[936]: ObjectMapper discovery complete
Oct 22 01:37:16 romulus systemd[1]: Created slice system-dropbear.slice.
Oct 22 01:37:16 romulus systemd[1]: Started SSH Per-Connection Server (172.17.0.3:58158).
Oct 22 01:37:16 romulus dropbear[993]: Child connection from ::ffff:172.17.0.3:58158
Oct 22 01:37:17 romulus dropbear[993]: PAM password auth succeeded for 'root' from ::ffff:172.17.0.3:58158
Oct 22 01:37:35 romulus systemd[1]: xyz.openbmc_project.Syslog.Config.service: Start operation timed out. Terminating.
Oct 22 01:37:36 romulus systemd[1]: xyz.openbmc_project.Syslog.Config.service: Failed with result 'timeout'.
Oct 22 01:37:36 romulus systemd[1]: Failed to start Rsyslog config updater.
Oct 22 01:37:37 romulus systemd[1]: xyz.openbmc_project.Syslog.Config.service: Service hold-off time over, scheduling restart.
Oct 22 01:37:37 romulus systemd[1]: xyz.openbmc_project.Syslog.Config.service: Scheduled restart job, restart counter is at 1.
Oct 22 01:37:37 romulus systemd[1]: Stopped Rsyslog config updater.
Oct 22 01:37:37 romulus systemd[1]: Starting Rsyslog config updater...
Oct 22 01:37:39 romulus systemd[1]: xyz.openbmc_project.Logging.service: Start operation timed out. Terminating.
Oct 22 01:37:39 romulus systemd[1]: xyz.openbmc_project.Logging.service: Failed with result 'timeout'.
Oct 22 01:37:39 romulus systemd[1]: Failed to start Phosphor Log Manager.
Oct 22 01:37:39 romulus systemd[1]: Dependency failed for IBM Log Manager.
Oct 22 01:37:39 romulus systemd[1]: com.ibm.Logging.service: Job com.ibm.Logging.service/start failed with result 'dependency'.
Oct 22 01:37:39 romulus systemd[1]: xyz.openbmc_project.Inventory.Manager.service: Start operation timed out. Terminating.
Oct 22 01:37:39 romulus systemd[1]: xyz.openbmc_project.Inventory.Manager.service: Failed with result 'timeout'.
Oct 22 01:37:39 romulus systemd[1]: Failed to start Phosphor Inventory Manager.
Oct 22 01:37:39 romulus systemd[1]: avahi-daemon.service: Start operation timed out. Terminating.
Oct 22 01:37:39 romulus avahi-daemon[913]: Got SIGTERM, quitting.
Got a change in recently to disable core dumps, which causes this problem to snowball. Managed to get some FFDC from a failure - https://openpower.xyz/job/openbmc-test-qemu-ci/5848/artifact/logs/20181029144041159347_TestsTestRestInterfaces/20181029144041159347_VerifyRESTPutMessageJSONCompliant/
I'm not sure if this is a symptom or cause but dbus-daemon runs into issues activating sysytemd targets when this fail hits:
Oct 29 14:26:15 romulus dbus-daemon[916]: [system] Failed to activate service 'org.freedesktop.hostname1': timed out (service_start_timeout=25000ms)
The going theory is that this is a performance related issue. Recent changes have resulted in more applications being started during the initial boot of BMC. Also, when booting QEMU, it's treated as a genesis IPL so things like nginx and dropbear runs very intensive workloads to generate their appropriate keys.
https://svgshare.com/i/9DF.svg shows a systemd-bootchart of the boot to standby. Between time 45 and 95, the CPU is pegged (purple) and the CPU wait is pegged (yellow).
The top CPU consumers during this time are:
8883.3ms /usr/sbin/dropbearkey -t rsa -f /etc/dropbear/dropbear_rsa_host_key
8325.0ms python /usr/sbin/phosphor-mapper --service_namespaces=org.openbmc com.ibm xyz.openbmc_project org.open_power --interface_ namespaces=org.openbmc com.ibm xyz.openbmc_project org.open_power --service_blacklists=
7503.3ms openssl req -x509 -sha256 -newkey rsa:2048 -keyout /etc/ssl/certs/nginx/cert.pem -out /etc/ssl/certs/nginx/cert.pem -days 3650 -subj /O=openbmc-project.xyz/CN=localhost -nodes
4651.6ms /lib/systemd/systemd-journald
3873.1ms [kworker/0:1
3503.7ms /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
2433.6ms python /usr/sbin/chassis_control.py
2034.8ms rngd
The good news is that the C++ mapper and the removal of nginx should really help out here. I could make some changes to try and artificially serialize some of these services but that seems to go against the grain of doing things as efficiently as possible.
https://github.com/openbmc/qemu/issues/14 tracks an offshoot of this issue. The primary problem has been fixed by creating a dedicated QEMU job, i.e. https://openpower.xyz/view/CI/job/run-qemu-ci/
We've seen this a few times in CI. Something causes the BMC to get hung up during startup. Here's some data from one fail. A BMC reset fixes the issue: