Closed sivassrr closed 7 years ago
@sivassrr this typically means that there is no server running on that witherspoon listening on port-22. If you have a serial console, what is it telling ?
ping is responding, but no ssh
serialoutputw37.txt Here is the screen capture
dmesgbmchang.txt journalctlbmchangfailure.txt systemctl.txt
All are from another system (Barreleye) with similar failure. This failed observed in the first iteration itself.
BMC level as follows:
root@barreleye:/var/lib/systemd/coredump# cat /etc/os-release
ID="openbmc-phosphor"
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="v1.99.2-89"
VERSION_ID="v1.99.2-89-gf327ed7-dirty"
PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v1.99.2-89"
BUILD_ID="v1.99.2"
root@barreleye:/var/lib/systemd/coredump# cat /etc/version
20170221021702
The console output looks like it was doing a shutdown? Did we run into a case where the shutdown hung?
Yes, For both Witherspoon and Barreleye, It was during the warm reset.
But data is there for Witherspoon, Based on the data available from the combined report the failure started from the Test Case: Test events after openbmc reboot. Failure Description as follows:
Full Name: witherspoonCombinedReport.Tests.Test Eventlog.Test events after openbmc reboot
Documentation:
This is to test event can be deleted created prior to openbmc reboot Steps: Create event, Power off if ON else no-op Reboot openbmc, Wait for BMC to READY or Powered OFF state Events should exist post reboot, Create two more events, Delete old and new event
===========================================================
This is to test event can be deleted created prior to openbmc reboot Steps: Create event, Power off if ON else no-op Reboot openbmc, Wait for BMC to READY or Powered OFF state Events should exist post reboot, Create two more events, Delete old and new event
Tags: bmcreboot, Event_Logging
Start / End / Elapsed: 20170220 12:11:10.052 / 20170220 12:12:01.075 / 00:00:51.023
Status: FAIL (critical)
Message: ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /xyz/openbmc_project/state/BMC0//attr/RequestedBMCTransition (Caused by ProtocolError('Connection aborted.', error(104, 'Connection reset by peer')))
Also teardown failed:
Several failures occurred:
1) SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer
2) SSHException: SSH session not active
3) SSHException: SSH session not active
4) SSHException: SSH session not active
5) SSHException: SSH session not active
6) SSHException: SSH session not active
7) SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer
8) SSHException: SSH session not active
[ Message content over the limit has been removed. ]
23) ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /org/openbmc/records/events/enumerate (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0181765950>: Failed to establish a new connection: [Errno 111] Connection refused',))
24) ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /login (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f018210d450>: Failed to establish a new connection: [Errno 111] Connection refused',)): FAIL != PASS
25) Resolving variable '${resp.status_code}' failed: AttributeError: 'unicode' object has no attribute 'status_code'
26) ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /org/openbmc/inventory/system/enumerate (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0182103b10>: Failed to establish a new connection: [Errno 111] Connection refused',))
As there is no SSH, We couldnt collect any FFDC data. If any further data needed, kindly look the combined pdf report.
Let know what are all the data need to be collected in case if we hit similar failure again. So that we can collect and provide useful debug data.
With the latest BMC / PNOR sandbox, Could hit it again. This time failure TC is as follows:
Full Name: witherspoonCombinedReport.Tests.Test Association.Association Unchanged After Reboot
Documentation:
**GOOD PATH** This test case is to verify that error log association does not change after open bmc reboot.
Tags: Association, Association_Unchanged_After_Reboot, bmcreboot
Start / End / Elapsed: 20170224 14:24:03.073 / 20170224 14:24:44.126 / 00:00:41.053
Status: FAIL (critical)
Message: ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /xyz/openbmc_project/state/BMC0//attr/RequestedBMCTransition (Caused by ProtocolError('Connection aborted.', error(104, 'Connection reset by peer')))
Also teardown failed:
Several failures occurred:
1) SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer
2) SSHException: SSH session not active
Found out openstack were also hitting with this issue some days back. Here is the details. https://bugs.launchpad.net/grenade/+bug/1349617 Some of the important information from the above link as follows:
4. The error “Exception: Error reading SSH protocol banner[Errno 104] Connection reset by peer” implies a corruption of some kind (being that it seems communication wasn’t a problem otherwise, there’s a route) -- this seems consistent with too low of an mtu and data getting truncated “occasionally.” In the log [3], the attempt to connect begins with connection refused (before sshd starts), then changes to authentication failure (likely before the guest has tried to pull the key from the metadata service), then changes to the ssh protocol banner read error. Which sounds like the key was retrieved but it’s corrupted (truncated?).
5. Web search for the same error yielded others having problems with mtu setting in the guest, where they can ping but not ssh with key pair, openstack [4] and cirros [5].
Is it at all possible that there’s an issue with the mtu of the guest sometimes? It would explain the randomness and the protocol banner errors, if data is getting truncated sometimes. I’m not sure where to go from here, I didn’t think anything like this would show up in the guest kernel logs.
This could be issue with opening lot of SSH connections and not closing it. We are fixing it in https://gerrit.openbmc-project.xyz/#/c/2690/
At the same time, Can we have a check in BMC for no. of connections threshold and issue a warning etc. Another thing is most of the time it observed during reboot. Is reboot fails because of more no. of connections left it open?
Good to see from Driver 587 on wards, I could see this on every reboot. [ OK ] Stopped Phosphor Hwmon Poller. [ OK ] Stopped Phosphor Time Manager daemon. [ OK ] Stopped SSH Per-Connection Server (9.3.62.89:49346). [ OK ] Stopped Phosphor Webserver. [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:42108). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:42114). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:42116). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:42118). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:42120).
With Driver 588, Could see this issue again. This time we are ensure that we close all open connections from test as well. Dr588Witherspoon10IterationsCombinedReport.pdf
Here is the serial connection output:
[ OK ] Reached target System Time Synchronized.
[ OK ] Stopped target Slices.
[ OK ] Stopped target Paths.
[ OK ] Stopped Forward Password Requests to Wall Directory Watch.
[ OK ] Started Dispatch Password Requests to Console Directory Watch.
[ OK ] Closed D-Bus System Message Bus Socket.
[ OK ] Reached target System Initialization.
Starting Network Service...
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
[ OK ] Stopped Network Service.
Starting Network Service...
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
[ OK ] Stopped Network Service.
Starting Network Service...
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
[ OK ] Stopped Network Service.
Starting Network Service...
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
[ OK ] Stopped Network Service.
Starting Network Service...
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
[ OK ] Stopped Network Service.
[FAILED] Failed to listen on Network Service Bus Name.
See 'systemctl status org.freedesktop.network1.busname' for details.
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
[ OK ] Reached target Network.
Nothing after this.
This indicates the network is not coming up. Might be a kernel problem. We need to get the 'dmesg' output and this:
[FAILED] Failed to listen on Network Service Bus Name.
See 'systemctl status org.freedesktop.network1.busname' for details.
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
What about the two systemctl status commands?
systemdnetwordservice.txt dmesgw33Mar1.txt
root@witherspoon:# cat /etc/os-release
ID="openbmc-phosphor"
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="v1.99.3-6"
VERSION_ID="v1.99.3-6-g1603890-dirty"
PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v1.99.3-6"
BUILD_ID="v1.99.3"
root@witherspoon:# cat /etc/version
20170228221548
Yesterday, I dont know the system who used after hang might have rebooted. So today decided to capture the dmesg and systemctl, right after the power cycle.
Seems there's some issue with the network configuration, but I'm not sure it's relevant. From the systemd-networkd
log:
Feb 22 06:05:31 witherspoon systemd-networkd[801]: eth0: Starting DHCPv6 client after NDisc timeout failed: Invalid argument
Feb 22 06:05:31 witherspoon systemd-networkd[801]: eth0: Configured
The dmesg
output looks normal wrt networking.
Had enabled data to collect systemctl, systemd-netword, dmesg and journalctl before the failure hit. In the first iteration itself it failed.
Some highlights from Journalctl:
Feb 22 06:11:59 witherspoon systemd[1]: Started SSH Per-Connection Server (9.126.171.88:39432).
Feb 22 06:11:59 witherspoon dropbear[891]: Failed loading /etc/dropbear/dropbear_dss_host_key
Feb 22 06:11:59 witherspoon dropbear[891]: Failed loading /etc/dropbear/dropbear_ecdsa_host_key
Feb 22 06:12:04 witherspoon dropbear[892]: lastlog_perform_login: Couldn't stat /var/log/lastlog: No such file or directory
Feb 22 06:12:04 witherspoon dropbear[892]: lastlog_openseek: /var/log/lastlog is not a file or directory!
Feb 22 06:12:10 witherspoon dropbear[891]: Exit (root): Disconnect received
Feb 22 06:12:48 witherspoon systemd[1]: Started SSH Per-Connection Server (9.126.171.88:39514).
Feb 22 06:12:48 witherspoon dropbear[898]: Failed loading /etc/dropbear/dropbear_dss_host_key
Feb 22 06:12:48 witherspoon dropbear[898]: Failed loading /etc/dropbear/dropbear_ecdsa_host_key
Feb 22 06:12:51 witherspoon dropbear[899]: lastlog_perform_login: Couldn't stat /var/log/lastlog: No such file or directory
Feb 22 06:12:51 witherspoon dropbear[899]: lastlog_openseek: /var/log/lastlog is not a file or directory!
Feb 22 06:12:56 witherspoon systemd[1]: Started SSH Per-Connection Server (9.126.171.88:39516).
Feb 22 06:12:56 witherspoon dropbear[904]: Failed loading /etc/dropbear/dropbear_dss_host_key
Feb 22 06:12:56 witherspoon dropbear[904]: Failed loading /etc/dropbear/dropbear_ecdsa_host_key
Feb 22 06:12:56 witherspoon dropbear[904]: Child connection from ::ffff:9.126.171.88:39516
Below may not be causing much problem, I guess:
Feb 22 06:05:39 witherspoon avahi-daemon[764]: Failed to open /usr/share/dbus-1/interfaces/org.freedesktop.Avahi.Server.xml: Invalid argument
http://svn.dd-wrt.com/ticket/2279 - Related issue
Journald is current in /tmp and it is not persistance, down the line there is a plan to move journald to be persistance.
This pattern (Ping Succees, SSH/REST - No), Could see anywhere from BMC boot to even runtime.
There seems to MAC address related issues on the witherspoon system in the lab.. Uma will take a look at the system and rectify the MAC's .
Because of this the PING/SSH /REST wont respond to the calls made..
@gkeishin do you think the failure "Failed loading /etc/dropbear/dropbear_dss_host_key" could be related to the MAC address related issue.
Yeah.. that's one probably theory for now until Uma verifies the BMC MAC configuration in the lab...
Seeing similar issue with Palmetto after BMC reboot. System is pingable but SSH and REST login is failing. Attached are the serial logs from the system.
Serial console logs: Palmetto_ssh_issue_serial_logs.txt
If MAC configuration is the issue then why should serial port also should go freeze? We are hitting multiple times now.
We keep seeing this issue in multiple systems and on other flavor as well. Checked with Uma and he thinks it is not related to Mac. He agreed to double check them.
The latest palmetto serial log shows we attempted to reboot the BMC. Is this expected?
If you left it running for a while at that point, it appears that the BMC is not going through a reboot correctly.
@shenki @amboar - This is one of the kernel hang issues we talked about the other day.
With latest driver 647 hit with this issue. Here is the serial console output. [ OK ] Removed slice system-org.openbmc.buttons.Power.slice. [ OK ] Stopped Clear one time boot overrides. [ OK ] Stopped Wait for /org/openbmc/settings/host0. Stopping Phosphor Settings Manager... [ OK ] Stopped Wait for /org/openbmc/control/bmc0. [ OK ] Stopped Wait for /org/openbmc/NetworkManager/Interface. [ OK ] Stopped Wait for /org/openbmc/managers/Download. [ OK ] Stopped target Fans Ready. Stopping Phosphor Hwmon Poller... [ OK ] Removed slice system-serial\x2dgetty.slice. Stopping Permit User Sessions... [ OK ] Removed slice system-xyz.openbmc_project.Hwmon.slice. [ OK ] Stopped target Network. Stopping Network Service... [ OK ] Removed slice system-xyz.openbmc_project.led.controller.slice. [ OK ] Removed slice system-obmc\x2dconsole.slice. [ OK ] Removed slice system-org.openbmc.control.Chassis.slice. [ OK ] Removed slice system-org.openbmc.control.Bmc.slice. [ OK ] Removed slice system-org.openbmc.watchdog.Host.slice. [ OK ] Stopped Phosphor Settings Manager. [ OK ] Stopped Phosphor Hwmon Poller. [ OK ] Stopped Permit User Sessions. [ OK ] Stopped Network Service.
I also observed the same issue on Barreleye with today's build #649/20170324001736. This is a blocking issue. I have seen this issue 2 times in last three days while running CT.
Stopping Avahi mDNS/DNS-SD Stack...
Stopping Phosphor Version Software Manager...
Stopping Phosphor Chassis State Manager...
Stopping Phosphor Inventory Manager...
Stopping Phosphor Console Muxer listening on device /dev/ttyVUART0...
Stopping Phosphor Host State Manager...
Stopping Phosphor Power Button0...
Stopping Phosphor Log Manager...
Stopping Phosphor Sensor Manager...
Stopping Phoshpor System Manager...
Stopping Phosphor User Manager...
Stopping Phosphor Chassis0 Control...
Stopping Phosphor MBOX Daemon...
Stopping Phosphor Power0 Control...
Stopping Phosphor LED Group Management Daemon...
Stopping Lightweight SLP Server...
Stopping Phosphor Watchdog0...
Stopping Phosphor Host0 Checkstop Monitor...
Stopping Phosphor BIOS Code Update...
Stopping Phosphor BMC0 Control...
Stopping Phosphor Fan Control...
Stopping Phosphor Inband IPMI...
Stopping Phosphor BMC Code Update...
Stopping Phosphor Reset Button0...
Stopping Phosphor Download Manager...
Stopping Phosphor BMC State Manager...
Stopping Phosphor IPMI BT DBus Bridge...
Stopping Phosphor Inventory Manager...
Stopping Phosphor Network Manager...
Stopping Phosphor Time Manager daemon...
[ OK ] Removed slice system-vcs_off_poweroff.slice. [ OK ] Stopped target System Time Synchronized. Stopping System Logging Service... [ OK ] Removed slice system-op\x2dreset\x2dset\x2dpower\x2don.slice. [ OK ] Stopped Wait for /org/openbmc/watchdog/host0. Stopping SSH Per-Connection Server (9.126.171.88:57582)... [ OK ] Stopped Phosphor BMC0 Control. [ OK ] Stopped Phosphor Version Software Manager. [ OK ] Stopped Phosphor Console Muxer listening on device /dev/ttyVUART0. [ OK ] Stopped System Logging Service. [ OK ] Stopped Phosphor sysfs LED controller. [ OK ] Stopped Phosphor Network Manager. [ OK ] Stopped Phosphor IPMI BT DBus Bridge. [ OK ] Stopped Phosphor Watchdog0. [ OK ] Stopped Phosphor sysfs LED controller. [ OK ] Stopped Phosphor Hwmon Poller. [ OK ] Stopped Kernel Logging Service. [ OK ] Stopped Phosphor Chassis0 Control. [ OK ] Stopped Phosphor sysfs LED controller. [ OK ] Stopped Phosphor Sensor Manager. [ OK ] Stopped Phosphor Log Manager. [ OK ] Stopped Phosphor Inventory Manager. [ OK ] Stopped Phosphor User Manager. [ OK ] Stopped Phosphor Inventory Manager. [ OK ] Stopped Phosphor Download Manager. [ OK ] Stopped Avahi mDNS/DNS-SD Stack. [ OK ] Stopped Phosphor MBOX Daemon. [ OK ] Stopped Phosphor Event Manager. [ OK ] Stopped Phosphor LED Group Management Daemon. [ OK ] Stopped Serial Getty on ttyS4. [ OK ] Stopped Lightweight SLP Server. [ OK ] Stopped Phoshpor System Manager. [ OK ] Stopped Phosphor Webserver. [ OK ] Stopped Phosphor DNS Settings Cache Manager. [ OK ] Stopped Phosphor Fan Control. [ OK ] Stopped Phosphor BIOS Code Update. [ OK ] Stopped Phosphor BMC Code Update. [ OK ] Stopped Phosphor Inband IPMI. [ OK ] Stopped Phosphor Reset Button0. [ OK ] Stopped Phosphor Host0 Checkstop Monitor. [ OK ] Stopped Phosphor Host State Manager. [ OK ] Stopped Phosphor Power0 Control. [ OK ] Stopped Phosphor Chassis State Manager. [ OK ] Stopped Phosphor Power Button0. [ OK ] Stopped Phosphor BMC State Manager. [ OK ] Stopped Phosphor Time Manager daemon. [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:57574). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:57582). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:57584). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:57610). [ OK ] Stopped Reload mboxd after power off. [ OK ] Removed slice system-mboxd\x2dreload.slice. [ OK ] Stopped Wait for Power0 to turn off. [ OK ] Removed slice system-op\x2dwait\x2dpower\x2doff.slice. [ OK ] Removed slice system-dropbear.slice. [ OK ] Stopped Wait for /org/openbmc/control/power0. [ OK ] Removed slice system-org.openbmc.buttons.Power.slice. [ OK ] Removed slice system-org.openbmc.control.Power.slice. [ OK ] Removed slice system-org.openbmc.control.Checkstop.slice. [ OK ] Stopped Wait for /org/openbmc/managers/System. [ OK ] Removed slice system-org.openbmc.buttons.reset.slice. [ OK ] Stopped Clear one time boot overrides. [ OK ] Stopped Wait for /org/openbmc/settings/host0. Stopping Phosphor Settings Manager... [ OK ] Stopped Wait for /org/openbmc/control/bmc0. [ OK ] Stopped Wait for /org/openbmc/managers/Download. [ OK ] Stopped target Fans Ready. Stopping Phosphor Hwmon Poller... [ OK ] Stopped Wait for /org/openbmc/NetworkManager/Interface. [ OK ] Stopped target Webserver (pre). [ OK ] Stopped target Network. Stopping Network Service... Stopping Permit User Sessions... [ OK ] Removed slice system-serial\x2dgetty.slice. [ OK ] Removed slice system-org.openbmc.control.Chassis.slice. [ OK ] Removed slice system-xyz.openbmc_project.Hwmon.slice. [ OK ] Removed slice system-org.openbmc.watchdog.Host.slice. [ OK ] Removed slice system-xyz.openbmc_project.led.controller.slice. [ OK ] Removed slice system-obmc\x2dconsole.slice. [ OK ] Removed slice system-org.openbmc.control.Bmc.slice. [ OK ] Stopped Phosphor Settings Manager. [ OK ] Stopped Network Service. [ OK ] Stopped Phosphor Hwmon Poller. [ OK ] Stopped Permit User Sessions.
Driver 655 (Latest) - Hanged after BMC reset. Above is the serial output. Ping fine, No SSH, Serial Connection freeze
One hang failure.. last seen message on serial console
slave@01:00: engine[20]: type 1b, version 1, addr 70000 size 8000
slave@01:00: engine[21]: type 1b, version 1, addr 78000 size 8000
[ OK ] Removed slice system-op\x2dreset\x2dset\x2dpower\x2don.slice.
Stopping Phosphor sysfs LED controller...
Stopping Phosphor sysfs LED controller...
[ OK ] Stopped target Timers.
[ OK ] Removed slice system-op\x2dwait\x2dpower\x2doff.slice.
[ OK ] Stopped target Chassis0 (Start).
[ OK ] Stopped Start Watchdog0.
[ OK ] Stopped target Chassis0 (Power On).
Stopping SSH Per-Connection Server (9.41.165.94:47800)...
Stopping SSH Per-Connection Server (9.41.165.94:48182)...
[ OK ] Removed slice system-fsi\x2dbind.slice.
[ OK ] Removed slice system-obmc\x2dstart\x2dwatchdog.slice.
[ OK ] Stopped Wait for /org/openbmc/watchdog/host0.
[ OK ] Removed slice system-vcs_off_poweroff.slice.
[ OK ] Stopped Wait for /xyz/openbmc_project/state/chassis0.
[ OK ] Stopped Daily Cleanup of Temporary Directories.
Stopping Phosphor sysfs LED controller...
[ OK ] Removed slice system-mboxd\x2dreload.slice.
Stopping Phosphor Webserver...
[ OK ] Removed slice system-op\x2dreset\x2dpgood\x2dcheck.slice.
[ OK ] Stopped target Host0 (Started).
[ OK ] Stopped target Start Host0.
[ OK ] Stopped Start host0 SBE.
[ OK ] Stopped target Start Host0 (Pre).
[ OK ] Stopped Set fans to a default speed.
[ OK ] Removed slice system-fan\x2ddefault\x2dspeed.slice.
[ OK ] Stopped target Restore Last State.
Stopping Phosphor Hwmon Poller...
[ OK ] Removed slice system-op\x2dpower\x2dstop.slice.
[ OK ] Removed slice system-getty.slice.
[ OK ] Stopped Wait for /org/openbmc/control/chassis0.
[ OK ] Removed slice system-op\x2dreset\x2dchassis\x2don.slice.
[ OK ] Stopped Wait for /org/openbmc/control/flash/bios.
[ OK ] Removed slice system-start_host.slice.
[ OK ] Stopped target System Time Synchronized.
[ OK ] Stopped Turn VCS on.
[ OK ] Stopped Run VCS workaround on host0.
[ OK ] Stopped Scan FSI devices.
[ OK ] Removed slice system-fsi\x2dscan.slice.
[ OK ] Removed slice system-vcs_on.slice.
[ OK ] Removed slice system-vcs_workaround.slice.
[ OK ] Removed slice system-obmc\x2dstop\x2dwatchdog.slice.
[ OK ] Stopped PCIE Slot Presence Detection For Chassis0.
[ OK ] Removed slice system-pcie\x2dslot\x2ddetect.slice.
[ OK ] Stopped target Power0 (On).
[ OK ] Stopped target Power0 On.
[ OK ] Stopped Wait for Power0 to turn on.
[ OK ] Removed slice system-op\x2dwait\x2dpower\x2don.slice.
[ OK ] Stopped Start Power0.
[ OK ] Stopped target Fan Control.
[ OK ] Removed slice system-op\x2dpower\x2dstart.slice.
[ OK ] Stopped target Power0 On (Pre).
[ OK ] Stopped target Phosphor System - Standby.
Stopping Phosphor Download Manager...
Stopping Phosphor Chassis0 Control...
Stopping Lightweight SLP Server...
Stopping Phosphor Fan Control...
Stopping Phosphor Network Manager...
Stopping Phosphor DNS Settings Cache Manager...
Stopping Phosphor BMC State Manager...
Stopping Phosphor Power0 Control...
Stopping Phosphor Inband IPMI...
Stopping Phosphor User Manager...
Stopping Phosphor Inventory Manager...
Stopping Phosphor LED Group Management Daemon...
Stopping Phosphor Sensor Manager...
Stopping Phosphor Power Button0...
Stopping Phoshpor System Manager...
Stopping Phosphor Log Manager...
Stopping Phosphor Event Manager...
Stopping Phosphor BIOS Code Update...
Stopping Phosphor Watchdog0...
[ OK ] Stopped target Multi-User System.
Stopping Kernel Logging Service...
Stopping Avahi mDNS/DNS-SD Stack...
Stopping System Logging S Stopping Serial Getty on ttyS4...
Stopping Phosphor Host State Manager...
Stopping Phosphor MBOX Daemon...
Stopping Phosphor Host0 Checkstop Monitor...
Stopping Phosphor Console Muxer listening on device /dev/ttyVUART0...
Stopping Phosphor IPMI BT DBus Bridge...
Stopping Phosphor Time Manager daemon...
Stopping Phosphor BMC0 Control...
Stopping Phosphor Reset Button0...
Stopping Phosphor BMC Code Update...
Stopping Phosphor Version Software Manager...
[ OK ] Stopped Disable VCS before Witherspoon power on.
[ OK ] Stopped Enable the AVS bus on VRMs.
[ OK ] Removed slice system-avsbus\x2denable.slice.
[ OK ] Stopped Apply voltage overrides to VRMs.
[ OK ] Stopped Disable the AVS bus on the VRMs.
[ OK ] Removed slice system-avsbus\x2ddisable.slice.
[ OK ] Stopped Perform AVS bus workaround on VRMs.
[ OK ] Removed slice system-avsbus\x2dworkaround.slice.
[ OK ] Removed slice system-vrm\x2dcontrol.slice.
[ OK ] Removed slice system-vcs_off.slice.
[ OK ] Stopped Disable UCD VCS rails.
[ OK ] Removed slice system-ucd_disable_vcs.slice.
Stopping Phosphor Chassis State Manager...
[ OK ] Stopped Wait for /xyz/openbmc_project/inventory.
Stopping Phosphor Inventory Manager...
[ OK ] Stopped Phosphor Inventory Manager.
[ OK ] Stopped Phosphor Version Software Manager.
[ OK ] Stopped Phosphor BMC0 Control.
[ OK ] Stopped Phosphor Event Manager.
[ OK ] Stopped Phosphor IPMI BT DBus Bridge.
[ OK ] Stopped Phosphor Sensor Manager.
[ OK ] Stopped Kernel Logging Service.
[ OK ] Stopped Phosphor LED Group Management Daemon.
[ OK ] Stopped Phosphor Chassis0 Control.
[ OK ] Stopped System Logging Service.
[ OK ] Stopped Phosphor Download Manager.
[ OK ] Stopped Phosphor User Manager.
[ OK ] Stopped Phosphor Watchdog0.
[ OK ] Stopped Phosphor Log Manager.
[ OK ] Stopped Avahi mDNS/DNS-SD Stack.
[ OK ] Stopped Phosphor Network Manager.
[ OK ] Stopped Phosphor Inventory Manager.
[ OK ] Stopped Lightweight SLP Server.
[ OK ] Stopped Phosphor sysfs LED controller.
[ OK ] Stopped Phosphor sysfs LED controller.
[ OK ] Stopped Phosphor sysfs LED controller.
[ OK ] Stopped Phosphor Console Muxer listening on device /dev/ttyVUART0.
[ OK ] Stopped Phosphor MBOX Daemon.
[ OK ] Stopped Serial Getty on ttyS4.
[ OK ] Stopped Phoshpor System Manager.
[ OK ] Stopped Phosphor Webserver.
[ OK ] Stopped Phosphor DNS Settings Cache Manager.
[ OK ] Stopped Phosphor Fan Control.
[ OK ] Stopped Phosphor BMC Code Update.
[ OK ] Stopped Phosphor BIOS Code Update.
[ OK ] Stopped Phosphor Host0 Checkstop Monitor.
[ OK ] Stopped Phosphor Power0 Control.
[ OK ] Stopped Phosphor BMC State Manager.
[ OK ] Stopped Phosphor Host State Manager.
[ OK ] Stopped Phosphor Reset Button0.
[ OK ] Stopped Phosphor Chassis State Manager.
[ OK ] Stopped Phosphor Power Button0.
[ OK ] Stopped Phosphor Inband IPMI.
[ OK ] Stopped Phosphor Time Manager daemon.
[ OK ] Stopped Phosphor Hwmon Poller.
[ OK ] Stopped SSH Per-Connection Server (9.41.165.94:47800).
[ OK ] Stopped SSH Per-Connection Server (9.41.165.94:48182).
[ OK ] Removed slice system-dropbear.slice.
[ OK ] Removed slice system-xyz.openbmc_project.Hwmon.slice.
[ OK ] Stopped Wait for /org/openbmc/control/power0.
[ OK ] Stopped Wait for /org/openbmc/settings/host0.
Stopping Phosphor Settings Manager...
[ OK ] Stopped Clear one time boot overrides.
[ OK ] Removed slice system-org.openbmc.buttons.Power.slice.
[ OK ] Removed slice system-org.openbmc.buttons.reset.slice.
[ OK ] Removed slice system-org.openbmc.control.Power.slice.
[ OK ] Removed slice system-org.openbmc.control.Checkstop.slice.
[ OK ] Stopped Wait for /org/openbmc/managers/System.
[ OK ] Stopped Wait for /org/openbmc/control/bmc0.
[ OK ] Stopped Wait for /org/openbmc/managers/Download.
[ OK ] Stopped target Fans Ready.
Stopping Phosphor Hwmon Poller...
[ OK ] Stopped Wait for /org/openbmc/NetworkManager/Interface.
[ OK ] Stopped target Webserver (pre).
[ OK ] Removed slice system-serial\x2dgetty.slice.
Stopping Permit User Sessions...
[ OK ] Removed slice system-obmc\x2dconsole.slice.
[ OK ] Removed slice system-xyz.openbmc_project.led.controller.slice.
[ OK ] Removed slice system-org.openbmc.watchdog.Host.slice.
[ OK ] Removed slice system-org.openbmc.control.Chassis.slice.
[ OK ] Removed slice system-org.openbmc.control.Bmc.slice.
[ OK ] Stopped Phosphor Settings Manager.
[ OK ] Stopped Phosphor Hwmon Poller.
[ OK ] Stopped Permit User Sessions.
[ OK ] Stopped target Remote File Systems.
[ OK ] Stopped target Network.
Stopping Network Service...
[ OK ] Stopped Wait for /org/openbmc/sensors.
[ OK ] Removed slice system-mapper\x2dwait.slice.
[ OK ] Stopped target Phosphor Object Mapper.
Stopping Phosphor DBus Service Discovery Manager...
[ OK ] Stopped Network Service.
[ OK ] Stopped Phosphor DBus Service Discovery Manager.
ftgmac100 1e660000.ethernet eth0: [ISR] = 0x57: NO_RXBUF
ftgmac100 1e660000.ethernet eth0: [ISR] = 0x57: NO_RXBUF
ftgmac100 1e660000.ethernet eth0: [ISR] = 0x57: NO_RXBUF
ftgmac100 1e660000.ethernet eth0: [ISR] = 0x57: NO_RXBUF
I think this problem went back when the BMC reboot was implemented https://github.com/openbmc/openbmc/issues/1070
so we may want to look at how the BMC reboot is actually forcing BMC to reset @andrewg
If you search for "systemd reboot.target hangs" you'll see quite a few hits. @saqibkh is working on updating systemd next for us in openbmc, I'm hoping we maybe pull in a fix there.
Didnt observe much difference in serial connection output during the normal IPL and BMC hang time.
@sivassrr systemd was upgraded to version 232 today. Please test the master branch for related failures.
Could able to recreate with driver 682. Here is the combined test report. Last 3 iteration complete failure. At-least till 7th iteration some test cases passed. logwitherspoonCombinedIterations10Report.pdf
With the latest BMC driver , And latest PNOR, Still hit the same issue. BMC failed to ssh, Ping success.
root@witherspoon:~# cat /etc/timestamp 20170519212253 root@witherspoon:~# cat /etc/os-release ID="openbmc-phosphor" NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)" VERSION="v1.99.5-60" VERSION_ID="v1.99.5-60-gecfa662" PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v1.99. 5-60" BUILD_ID="v1.99.5" root@witherspoon:~#
• BUILD ID : https://openpower.xyz/job/openbmc-build/766/
• VERSION_ID : v1.99.6-10-g1d22c22
• BMC Kernel : Linux witherspoon 4.10.5-8736136e756ec127de0bbe2c7e2de683204d4512 #1 Fri May 26 05:22:23 UTC 2017 armv6l GNU/Linux
we are hit on new kernel 4.10
Here's the same section in the shutdown, but on one that works, anyone see an obvious service that's not stopping correctly in the fail conditions?
[ OK ] Stopped Phosphor LED Group Management Daemon.
[ OK ] Stopped OpenBMC Software Update Manager.
[ OK ] Stopped OpenPower Software Update Manager.
[ OK ] Stopped Phosphor Settings Manager.
[ OK ] Stopped Phosphor Hwmon Poller.
[ OK ] Stopped Permit User Sessions.
[ OK ] Stopped target Remote File Systems.
[ OK ] Stopped target Network.
Stopping Network Service...
[ OK ] Stopped Wait for /org/openbmc/sensors.
[ OK ] Removed slice system-mapper\x2dwait.slice.
[ OK ] Stopped target Phosphor Object Mapper.
Stopping Phosphor DBus Service Discovery Manager...
[ OK ] Stopped Phosphor DBus Service Discovery Manager.
[ OK ] Stopped Network Service.
Stopping D-Bus System Message Bus...
[ OK ] Stopped D-Bus System Message Bus.
[ OK ] Stopped target Basic System.
[ OK ] Stopped target Paths.
[ OK ] Stopped Forward Password Requests to Wall Directory Watch.
[ OK ] Stopped Dispatch Password Requests to Console Directory Watch.
[ OK ] Stopped target Slices.
[ OK ] Stopped target Sockets.
[ OK ] Closed Phosphor Host Console SSH Per-Connection socket.
[ OK ] Closed Phosphor DBUS Browser socket.
[ OK ] Closed Phosphor Webserver socket.
[ OK ] Closed phosphor-ipmi-net.socket.
[ OK ] Closed dropbear.socket.
[ OK ] Closed Avahi mDNS/DNS-SD Stack Activation Socket.
[ OK ] Closed Syslog Socket.
[ OK ] Closed D-Bus System Message Bus Socket.
[ OK ] Stopped target System Initialization.
Spent some time on this last week and today. It seems like the issue here is that systemd is getting hung up on the reboot. Patrick and I found a few interesting reads in this area:
I tried the solution proposed in one but I think our dbus.service is required by basic.target so making it run Before= caused a cyclical dependency.
One theory is that the automated test software is coming in via REST during the reboot process and starting up the REST/network service. I've asked @gkeishin to try a test where he adds a delay between the reboot request and the first SSH/REST interaction as a data point.
Been running a recreate attempt with the following:
vi /etc/systemd/journald.conf
ForwardToConsole=yes
So far no luck, getting the journal to the console on the fail should really help us debug the problem, but if it's a race condition, this may end up changing the timings.
I also had a bash loop going in parallel doing curl command logins.
BMC Driver is from Andrew: witherspoon-20170218231615.all.tar PNOR from Uma: /witherspoon_20170218n.pnor
First basic 10 iterations went fine.
Enabled some more test and run power on / power off.
First iteration went fine and 2nd iteration on wards timeout and all remaining test failed. Point of failure was on reboot observed this.
Error Details as follows: