openbmc / openbmc

OpenBMC Distribution
Other
1.93k stars 916 forks source link

System: bmc intermittently hangs on reboot, serial console hangs #1161

Closed sivassrr closed 7 years ago

sivassrr commented 7 years ago

BMC Driver is from Andrew: witherspoon-20170218231615.all.tar PNOR from Uma: /witherspoon_20170218n.pnor

First basic 10 iterations went fine.

Enabled some more test and run power on / power off.

First iteration went fine and 2nd iteration on wards timeout and all remaining test failed. Point of failure was on reboot observed this.

Error Details as follows:

ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /login (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f01817bf210>: Failed to establish a new connection: [Errno 111] Connection refused',)): FAIL != PASS Also teardown failed: Several failures occurred: 
1) NoValidConnectionsError: [Errno None] Unable to connect to port 22 on DUMMYIP 
2) Connection not open 
3) Connection not open 
4) Connection not open
 5) Connection not open 
6) Connection not open
 7) NoValidConnectionsError: [Errno None] Unable to connect to port 22 on DUMMYIP 8) Connection not open [ Message content over the limit has been removed. ] 

ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /org/openbmc/records/events/enumerate (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f018215c750>: Failed to establish a new connection: [Errno 111] Connection refused',)) 24) ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /login (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f018211b310>: Failed to establish a new connection: [Errno 111] Connection refused',)): FAIL != PASS 25) Resolving variable '${resp.status_code}' failed: AttributeError: 'unicode' object has no attribute 'status_code' 26) ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /org/openbmc/inventory/system/enumerate (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f01820b43d0>: Failed to establish a new connection: [Errno 111] Connection refused',))
sivassrr commented 7 years ago

MFGStage2BMC1823161PNOR20170218n.pdf

vishwabmc commented 7 years ago

@sivassrr this typically means that there is no server running on that witherspoon listening on port-22. If you have a serial console, what is it telling ?

sivassrr commented 7 years ago

ping is responding, but no ssh

sivassrr commented 7 years ago

serialoutputw37.txt Here is the screen capture

sivassrr commented 7 years ago

dmesgbmchang.txt journalctlbmchangfailure.txt systemctl.txt

All are from another system (Barreleye) with similar failure. This failed observed in the first iteration itself.

sivassrr commented 7 years ago

BMC level as follows:

root@barreleye:/var/lib/systemd/coredump# cat /etc/os-release
ID="openbmc-phosphor"
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="v1.99.2-89"
VERSION_ID="v1.99.2-89-gf327ed7-dirty"
PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v1.99.2-89"
BUILD_ID="v1.99.2"
root@barreleye:/var/lib/systemd/coredump# cat /etc/version
20170221021702
williamspatrick commented 7 years ago

The console output looks like it was doing a shutdown? Did we run into a case where the shutdown hung?

sivassrr commented 7 years ago

Yes, For both Witherspoon and Barreleye, It was during the warm reset.

But data is there for Witherspoon, Based on the data available from the combined report the failure started from the Test Case: Test events after openbmc reboot. Failure Description as follows:

Full Name:  witherspoonCombinedReport.Tests.Test Eventlog.Test events after openbmc reboot
Documentation:  
This is to test event can be deleted created prior to openbmc reboot Steps: Create event, Power off if ON else no-op Reboot openbmc, Wait for BMC to READY or Powered OFF state Events should exist post reboot, Create two more events, Delete old and new event

===========================================================

This is to test event can be deleted created prior to openbmc reboot Steps: Create event, Power off if ON else no-op Reboot openbmc, Wait for BMC to READY or Powered OFF state Events should exist post reboot, Create two more events, Delete old and new event
Tags:   bmcreboot, Event_Logging
Start / End / Elapsed:  20170220 12:11:10.052 / 20170220 12:12:01.075 / 00:00:51.023
Status:     FAIL (critical)
Message:    ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /xyz/openbmc_project/state/BMC0//attr/RequestedBMCTransition (Caused by ProtocolError('Connection aborted.', error(104, 'Connection reset by peer')))

Also teardown failed:
Several failures occurred:

1) SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer

2) SSHException: SSH session not active

3) SSHException: SSH session not active

4) SSHException: SSH session not active

5) SSHException: SSH session not active

6) SSHException: SSH session not active

7) SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer

8) SSHException: SSH session not active

    [ Message content over the limit has been removed. ]
23) ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /org/openbmc/records/events/enumerate (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0181765950>: Failed to establish a new connection: [Errno 111] Connection refused',))

24) ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /login (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f018210d450>: Failed to establish a new connection: [Errno 111] Connection refused',)): FAIL != PASS

25) Resolving variable '${resp.status_code}' failed: AttributeError: 'unicode' object has no attribute 'status_code'

26) ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /org/openbmc/inventory/system/enumerate (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0182103b10>: Failed to establish a new connection: [Errno 111] Connection refused',))

As there is no SSH, We couldnt collect any FFDC data. If any further data needed, kindly look the combined pdf report.

sivassrr commented 7 years ago

Let know what are all the data need to be collected in case if we hit similar failure again. So that we can collect and provide useful debug data.

sivassrr commented 7 years ago

With the latest BMC / PNOR sandbox, Could hit it again. This time failure TC is as follows:

Full Name:  witherspoonCombinedReport.Tests.Test Association.Association Unchanged After Reboot
Documentation:  

**GOOD PATH** This test case is to verify that error log association does not change after open bmc reboot.
Tags:   Association, Association_Unchanged_After_Reboot, bmcreboot
Start / End / Elapsed:  20170224 14:24:03.073 / 20170224 14:24:44.126 / 00:00:41.053
Status:     FAIL (critical)
Message:    ConnectionError: HTTPSConnectionPool(host='DUMMYIP', port=443): Max retries exceeded with url: /xyz/openbmc_project/state/BMC0//attr/RequestedBMCTransition (Caused by ProtocolError('Connection aborted.', error(104, 'Connection reset by peer')))

Also teardown failed:
Several failures occurred:

1) SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer

2) SSHException: SSH session not active
sivassrr commented 7 years ago

Found out openstack were also hitting with this issue some days back. Here is the details. https://bugs.launchpad.net/grenade/+bug/1349617 Some of the important information from the above link as follows:

4. The error “Exception: Error reading SSH protocol banner[Errno 104] Connection reset by peer” implies a corruption of some kind (being that it seems communication wasn’t a problem otherwise, there’s a route) -- this seems consistent with too low of an mtu and data getting truncated “occasionally.” In the log [3], the attempt to connect begins with connection refused (before sshd starts), then changes to authentication failure (likely before the guest has tried to pull the key from the metadata service), then changes to the ssh protocol banner read error. Which sounds like the key was retrieved but it’s corrupted (truncated?).

5. Web search for the same error yielded others having problems with mtu setting in the guest, where they can ping but not ssh with key pair, openstack [4] and cirros [5].

Is it at all possible that there’s an issue with the mtu of the guest sometimes? It would explain the randomness and the protocol banner errors, if data is getting truncated sometimes. I’m not sure where to go from here, I didn’t think anything like this would show up in the guest kernel logs.

https://github.com/paramiko/paramiko/issues/673

sivassrr commented 7 years ago

This could be issue with opening lot of SSH connections and not closing it. We are fixing it in https://gerrit.openbmc-project.xyz/#/c/2690/

At the same time, Can we have a check in BMC for no. of connections threshold and issue a warning etc. Another thing is most of the time it observed during reboot. Is reboot fails because of more no. of connections left it open?

sivassrr commented 7 years ago

w37Witherspoon10iterationDr587.pdf w42Witherspoon5IterationDr587.pdf

sivassrr commented 7 years ago

Good to see from Driver 587 on wards, I could see this on every reboot. [ OK ] Stopped Phosphor Hwmon Poller. [ OK ] Stopped Phosphor Time Manager daemon. [ OK ] Stopped SSH Per-Connection Server (9.3.62.89:49346). [ OK ] Stopped Phosphor Webserver. [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:42108). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:42114). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:42116). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:42118). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:42120).

sivassrr commented 7 years ago

With Driver 588, Could see this issue again. This time we are ensure that we close all open connections from test as well. Dr588Witherspoon10IterationsCombinedReport.pdf

sivassrr commented 7 years ago

Here is the serial connection output:

[  OK  ] Reached target System Time Synchronized.
[  OK  ] Stopped target Slices.
[  OK  ] Stopped target Paths.
[  OK  ] Stopped Forward Password Requests to Wall Directory Watch.
[  OK  ] Started Dispatch Password Requests to Console Directory Watch.
[  OK  ] Closed D-Bus System Message Bus Socket.
[  OK  ] Reached target System Initialization.
         Starting Network Service...
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
[  OK  ] Stopped Network Service.
         Starting Network Service...
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
[  OK  ] Stopped Network Service.
         Starting Network Service...
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
[  OK  ] Stopped Network Service.
         Starting Network Service...
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
[  OK  ] Stopped Network Service.
         Starting Network Service...
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
[  OK  ] Stopped Network Service.
[FAILED] Failed to listen on Network Service Bus Name.
See 'systemctl status org.freedesktop.network1.busname' for details.
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
[  OK  ] Reached target Network.

Nothing after this.

williamspatrick commented 7 years ago

This indicates the network is not coming up. Might be a kernel problem. We need to get the 'dmesg' output and this:

[FAILED] Failed to listen on Network Service Bus Name.
See 'systemctl status org.freedesktop.network1.busname' for details.
[FAILED] Failed to start Network Service.
See 'systemctl status systemd-networkd.service' for details.
sivassrr commented 7 years ago

dmesgw37.txt

williamspatrick commented 7 years ago

What about the two systemctl status commands?

sivassrr commented 7 years ago

systemdnetwordservice.txt dmesgw33Mar1.txt

root@witherspoon:# cat /etc/os-release
ID="openbmc-phosphor"
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="v1.99.3-6"
VERSION_ID="v1.99.3-6-g1603890-dirty"
PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v1.99.3-6"
BUILD_ID="v1.99.3"
root@witherspoon:# cat /etc/version
20170228221548
sivassrr commented 7 years ago

Yesterday, I dont know the system who used after hang might have rebooted. So today decided to capture the dmesg and systemctl, right after the power cycle.

amboar commented 7 years ago

Seems there's some issue with the network configuration, but I'm not sure it's relevant. From the systemd-networkd log:

Feb 22 06:05:31 witherspoon systemd-networkd[801]: eth0: Starting DHCPv6 client after NDisc timeout failed: Invalid argument
Feb 22 06:05:31 witherspoon systemd-networkd[801]: eth0: Configured

The dmesg output looks normal wrt networking.

sivassrr commented 7 years ago

Had enabled data to collect systemctl, systemd-netword, dmesg and journalctl before the failure hit. In the first iteration itself it failed.

sivassrr commented 7 years ago

openbmc-test-automation1journalctl0.txt openbmc-test-automation1dmesg0.txt openbmc-test-automation1systemctl0.txt openbmc-test-automation1NS0.txt

sivassrr commented 7 years ago

Some highlights from Journalctl:

Feb 22 06:11:59 witherspoon systemd[1]: Started SSH Per-Connection Server (9.126.171.88:39432).
Feb 22 06:11:59 witherspoon dropbear[891]: Failed loading /etc/dropbear/dropbear_dss_host_key
Feb 22 06:11:59 witherspoon dropbear[891]: Failed loading /etc/dropbear/dropbear_ecdsa_host_key

Feb 22 06:12:04 witherspoon dropbear[892]: lastlog_perform_login: Couldn't stat /var/log/lastlog: No such file or directory
Feb 22 06:12:04 witherspoon dropbear[892]: lastlog_openseek: /var/log/lastlog is not a file or directory!
Feb 22 06:12:10 witherspoon dropbear[891]: Exit (root): Disconnect received
Feb 22 06:12:48 witherspoon systemd[1]: Started SSH Per-Connection Server (9.126.171.88:39514).
Feb 22 06:12:48 witherspoon dropbear[898]: Failed loading /etc/dropbear/dropbear_dss_host_key
Feb 22 06:12:48 witherspoon dropbear[898]: Failed loading /etc/dropbear/dropbear_ecdsa_host_key

Feb 22 06:12:51 witherspoon dropbear[899]: lastlog_perform_login: Couldn't stat /var/log/lastlog: No such file or directory
Feb 22 06:12:51 witherspoon dropbear[899]: lastlog_openseek: /var/log/lastlog is not a file or directory!
Feb 22 06:12:56 witherspoon systemd[1]: Started SSH Per-Connection Server (9.126.171.88:39516).
Feb 22 06:12:56 witherspoon dropbear[904]: Failed loading /etc/dropbear/dropbear_dss_host_key
Feb 22 06:12:56 witherspoon dropbear[904]: Failed loading /etc/dropbear/dropbear_ecdsa_host_key
Feb 22 06:12:56 witherspoon dropbear[904]: Child connection from ::ffff:9.126.171.88:39516

Below may not be causing much problem, I guess:
Feb 22 06:05:39 witherspoon avahi-daemon[764]: Failed to open /usr/share/dbus-1/interfaces/org.freedesktop.Avahi.Server.xml: Invalid argument
sivassrr commented 7 years ago

http://svn.dd-wrt.com/ticket/2279 - Related issue

sivassrr commented 7 years ago

Journald is current in /tmp and it is not persistance, down the line there is a plan to move journald to be persistance.

sivassrr commented 7 years ago

This pattern (Ping Succees, SSH/REST - No), Could see anywhere from BMC boot to even runtime.

gkeishin commented 7 years ago

There seems to MAC address related issues on the witherspoon system in the lab.. Uma will take a look at the system and rectify the MAC's .

Because of this the PING/SSH /REST wont respond to the calls made..

sivassrr commented 7 years ago

@gkeishin do you think the failure "Failed loading /etc/dropbear/dropbear_dss_host_key" could be related to the MAC address related issue.

gkeishin commented 7 years ago

Yeah.. that's one probably theory for now until Uma verifies the BMC MAC configuration in the lab...

swe12345 commented 7 years ago

Seeing similar issue with Palmetto after BMC reboot. System is pingable but SSH and REST login is failing. Attached are the serial logs from the system.

Serial console logs: Palmetto_ssh_issue_serial_logs.txt

sivassrr commented 7 years ago

If MAC configuration is the issue then why should serial port also should go freeze? We are hitting multiple times now.

sivassrr commented 7 years ago

We keep seeing this issue in multiple systems and on other flavor as well. Checked with Uma and he thinks it is not related to Mac. He agreed to double check them.

williamspatrick commented 7 years ago

The latest palmetto serial log shows we attempted to reboot the BMC. Is this expected?

If you left it running for a while at that point, it appears that the BMC is not going through a reboot correctly.

williamspatrick commented 7 years ago

@shenki @amboar - This is one of the kernel hang issues we talked about the other day.

sivassrr commented 7 years ago

With latest driver 647 hit with this issue. Here is the serial console output. [ OK ] Removed slice system-org.openbmc.buttons.Power.slice. [ OK ] Stopped Clear one time boot overrides. [ OK ] Stopped Wait for /org/openbmc/settings/host0. Stopping Phosphor Settings Manager... [ OK ] Stopped Wait for /org/openbmc/control/bmc0. [ OK ] Stopped Wait for /org/openbmc/NetworkManager/Interface. [ OK ] Stopped Wait for /org/openbmc/managers/Download. [ OK ] Stopped target Fans Ready. Stopping Phosphor Hwmon Poller... [ OK ] Removed slice system-serial\x2dgetty.slice. Stopping Permit User Sessions... [ OK ] Removed slice system-xyz.openbmc_project.Hwmon.slice. [ OK ] Stopped target Network. Stopping Network Service... [ OK ] Removed slice system-xyz.openbmc_project.led.controller.slice. [ OK ] Removed slice system-obmc\x2dconsole.slice. [ OK ] Removed slice system-org.openbmc.control.Chassis.slice. [ OK ] Removed slice system-org.openbmc.control.Bmc.slice. [ OK ] Removed slice system-org.openbmc.watchdog.Host.slice. [ OK ] Stopped Phosphor Settings Manager. [ OK ] Stopped Phosphor Hwmon Poller. [ OK ] Stopped Permit User Sessions. [ OK ] Stopped Network Service.

rahulmah commented 7 years ago

I also observed the same issue on Barreleye with today's build #649/20170324001736. This is a blocking issue. I have seen this issue 2 times in last three days while running CT.

sivassrr commented 7 years ago
     Stopping Avahi mDNS/DNS-SD Stack...
     Stopping Phosphor Version Software Manager...
     Stopping Phosphor Chassis State Manager...
     Stopping Phosphor Inventory Manager...
     Stopping Phosphor Console Muxer listening on device /dev/ttyVUART0...
     Stopping Phosphor Host State Manager...
     Stopping Phosphor Power Button0...
     Stopping Phosphor Log Manager...
     Stopping Phosphor Sensor Manager...
     Stopping Phoshpor System Manager...
     Stopping Phosphor User Manager...
     Stopping Phosphor Chassis0 Control...
     Stopping Phosphor MBOX Daemon...
     Stopping Phosphor Power0 Control...
     Stopping Phosphor LED Group Management Daemon...
     Stopping Lightweight SLP Server...
     Stopping Phosphor Watchdog0...
     Stopping Phosphor Host0 Checkstop Monitor...
     Stopping Phosphor BIOS Code Update...
     Stopping Phosphor BMC0 Control...
     Stopping Phosphor Fan Control...
     Stopping Phosphor Inband IPMI...
     Stopping Phosphor BMC Code Update...
     Stopping Phosphor Reset Button0...
     Stopping Phosphor Download Manager...
     Stopping Phosphor BMC State Manager...
     Stopping Phosphor IPMI BT DBus Bridge...
     Stopping Phosphor Inventory Manager...
     Stopping Phosphor Network Manager...
     Stopping Phosphor Time Manager daemon...

[ OK ] Removed slice system-vcs_off_poweroff.slice. [ OK ] Stopped target System Time Synchronized. Stopping System Logging Service... [ OK ] Removed slice system-op\x2dreset\x2dset\x2dpower\x2don.slice. [ OK ] Stopped Wait for /org/openbmc/watchdog/host0. Stopping SSH Per-Connection Server (9.126.171.88:57582)... [ OK ] Stopped Phosphor BMC0 Control. [ OK ] Stopped Phosphor Version Software Manager. [ OK ] Stopped Phosphor Console Muxer listening on device /dev/ttyVUART0. [ OK ] Stopped System Logging Service. [ OK ] Stopped Phosphor sysfs LED controller. [ OK ] Stopped Phosphor Network Manager. [ OK ] Stopped Phosphor IPMI BT DBus Bridge. [ OK ] Stopped Phosphor Watchdog0. [ OK ] Stopped Phosphor sysfs LED controller. [ OK ] Stopped Phosphor Hwmon Poller. [ OK ] Stopped Kernel Logging Service. [ OK ] Stopped Phosphor Chassis0 Control. [ OK ] Stopped Phosphor sysfs LED controller. [ OK ] Stopped Phosphor Sensor Manager. [ OK ] Stopped Phosphor Log Manager. [ OK ] Stopped Phosphor Inventory Manager. [ OK ] Stopped Phosphor User Manager. [ OK ] Stopped Phosphor Inventory Manager. [ OK ] Stopped Phosphor Download Manager. [ OK ] Stopped Avahi mDNS/DNS-SD Stack. [ OK ] Stopped Phosphor MBOX Daemon. [ OK ] Stopped Phosphor Event Manager. [ OK ] Stopped Phosphor LED Group Management Daemon. [ OK ] Stopped Serial Getty on ttyS4. [ OK ] Stopped Lightweight SLP Server. [ OK ] Stopped Phoshpor System Manager. [ OK ] Stopped Phosphor Webserver. [ OK ] Stopped Phosphor DNS Settings Cache Manager. [ OK ] Stopped Phosphor Fan Control. [ OK ] Stopped Phosphor BIOS Code Update. [ OK ] Stopped Phosphor BMC Code Update. [ OK ] Stopped Phosphor Inband IPMI. [ OK ] Stopped Phosphor Reset Button0. [ OK ] Stopped Phosphor Host0 Checkstop Monitor. [ OK ] Stopped Phosphor Host State Manager. [ OK ] Stopped Phosphor Power0 Control. [ OK ] Stopped Phosphor Chassis State Manager. [ OK ] Stopped Phosphor Power Button0. [ OK ] Stopped Phosphor BMC State Manager. [ OK ] Stopped Phosphor Time Manager daemon. [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:57574). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:57582). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:57584). [ OK ] Stopped SSH Per-Connection Server (9.126.171.88:57610). [ OK ] Stopped Reload mboxd after power off. [ OK ] Removed slice system-mboxd\x2dreload.slice. [ OK ] Stopped Wait for Power0 to turn off. [ OK ] Removed slice system-op\x2dwait\x2dpower\x2doff.slice. [ OK ] Removed slice system-dropbear.slice. [ OK ] Stopped Wait for /org/openbmc/control/power0. [ OK ] Removed slice system-org.openbmc.buttons.Power.slice. [ OK ] Removed slice system-org.openbmc.control.Power.slice. [ OK ] Removed slice system-org.openbmc.control.Checkstop.slice. [ OK ] Stopped Wait for /org/openbmc/managers/System. [ OK ] Removed slice system-org.openbmc.buttons.reset.slice. [ OK ] Stopped Clear one time boot overrides. [ OK ] Stopped Wait for /org/openbmc/settings/host0. Stopping Phosphor Settings Manager... [ OK ] Stopped Wait for /org/openbmc/control/bmc0. [ OK ] Stopped Wait for /org/openbmc/managers/Download. [ OK ] Stopped target Fans Ready. Stopping Phosphor Hwmon Poller... [ OK ] Stopped Wait for /org/openbmc/NetworkManager/Interface. [ OK ] Stopped target Webserver (pre). [ OK ] Stopped target Network. Stopping Network Service... Stopping Permit User Sessions... [ OK ] Removed slice system-serial\x2dgetty.slice. [ OK ] Removed slice system-org.openbmc.control.Chassis.slice. [ OK ] Removed slice system-xyz.openbmc_project.Hwmon.slice. [ OK ] Removed slice system-org.openbmc.watchdog.Host.slice. [ OK ] Removed slice system-xyz.openbmc_project.led.controller.slice. [ OK ] Removed slice system-obmc\x2dconsole.slice. [ OK ] Removed slice system-org.openbmc.control.Bmc.slice. [ OK ] Stopped Phosphor Settings Manager. [ OK ] Stopped Network Service. [ OK ] Stopped Phosphor Hwmon Poller. [ OK ] Stopped Permit User Sessions.

Driver 655 (Latest) - Hanged after BMC reset. Above is the serial output. Ping fine, No SSH, Serial Connection freeze

gkeishin commented 7 years ago

One hang failure.. last seen message on serial console

 slave@01:00: engine[20]: type 1b, version 1, addr 70000 size 8000
 slave@01:00: engine[21]: type 1b, version 1, addr 78000 size 8000
[  OK  ] Removed slice system-op\x2dreset\x2dset\x2dpower\x2don.slice.
         Stopping Phosphor sysfs LED controller...
         Stopping Phosphor sysfs LED controller...
[  OK  ] Stopped target Timers.
[  OK  ] Removed slice system-op\x2dwait\x2dpower\x2doff.slice.
[  OK  ] Stopped target Chassis0 (Start).
[  OK  ] Stopped Start Watchdog0.
[  OK  ] Stopped target Chassis0 (Power On).
         Stopping SSH Per-Connection Server (9.41.165.94:47800)...
         Stopping SSH Per-Connection Server (9.41.165.94:48182)...
[  OK  ] Removed slice system-fsi\x2dbind.slice.
[  OK  ] Removed slice system-obmc\x2dstart\x2dwatchdog.slice.
[  OK  ] Stopped Wait for /org/openbmc/watchdog/host0.
[  OK  ] Removed slice system-vcs_off_poweroff.slice.
[  OK  ] Stopped Wait for /xyz/openbmc_project/state/chassis0.
[  OK  ] Stopped Daily Cleanup of Temporary Directories.
         Stopping Phosphor sysfs LED controller...
[  OK  ] Removed slice system-mboxd\x2dreload.slice.
         Stopping Phosphor Webserver...
[  OK  ] Removed slice system-op\x2dreset\x2dpgood\x2dcheck.slice.
[  OK  ] Stopped target Host0 (Started).
[  OK  ] Stopped target Start Host0.
[  OK  ] Stopped Start host0 SBE.
[  OK  ] Stopped target Start Host0 (Pre).
[  OK  ] Stopped Set fans to a default speed.
[  OK  ] Removed slice system-fan\x2ddefault\x2dspeed.slice.
[  OK  ] Stopped target Restore Last State.
         Stopping Phosphor Hwmon Poller...
[  OK  ] Removed slice system-op\x2dpower\x2dstop.slice.
[  OK  ] Removed slice system-getty.slice.
[  OK  ] Stopped Wait for /org/openbmc/control/chassis0.
[  OK  ] Removed slice system-op\x2dreset\x2dchassis\x2don.slice.
[  OK  ] Stopped Wait for /org/openbmc/control/flash/bios.
[  OK  ] Removed slice system-start_host.slice.
[  OK  ] Stopped target System Time Synchronized.
[  OK  ] Stopped Turn VCS on.
[  OK  ] Stopped Run VCS workaround on host0.
[  OK  ] Stopped Scan FSI devices.
[  OK  ] Removed slice system-fsi\x2dscan.slice.
[  OK  ] Removed slice system-vcs_on.slice.
[  OK  ] Removed slice system-vcs_workaround.slice.
[  OK  ] Removed slice system-obmc\x2dstop\x2dwatchdog.slice.
[  OK  ] Stopped PCIE Slot Presence Detection For Chassis0.
[  OK  ] Removed slice system-pcie\x2dslot\x2ddetect.slice.
[  OK  ] Stopped target Power0 (On).
[  OK  ] Stopped target Power0 On.
[  OK  ] Stopped Wait for Power0 to turn on.
[  OK  ] Removed slice system-op\x2dwait\x2dpower\x2don.slice.
[  OK  ] Stopped Start Power0.
[  OK  ] Stopped target Fan Control.
[  OK  ] Removed slice system-op\x2dpower\x2dstart.slice.
[  OK  ] Stopped target Power0 On (Pre).
[  OK  ] Stopped target Phosphor System - Standby.
         Stopping Phosphor Download Manager...
         Stopping Phosphor Chassis0 Control...
         Stopping Lightweight SLP Server...
         Stopping Phosphor Fan Control...
         Stopping Phosphor Network Manager...
         Stopping Phosphor DNS Settings Cache Manager...
         Stopping Phosphor BMC State Manager...
         Stopping Phosphor Power0 Control...
         Stopping Phosphor Inband IPMI...
         Stopping Phosphor User Manager...
         Stopping Phosphor Inventory Manager...
         Stopping Phosphor LED Group Management Daemon...
         Stopping Phosphor Sensor Manager...
         Stopping Phosphor Power Button0...
         Stopping Phoshpor System Manager...
         Stopping Phosphor Log Manager...
         Stopping Phosphor Event Manager...
         Stopping Phosphor BIOS Code Update...
         Stopping Phosphor Watchdog0...
[  OK  ] Stopped target Multi-User System.
         Stopping Kernel Logging Service...
         Stopping Avahi mDNS/DNS-SD Stack...
         Stopping System Logging S         Stopping Serial Getty on ttyS4...
         Stopping Phosphor Host State Manager...
         Stopping Phosphor MBOX Daemon...
         Stopping Phosphor Host0 Checkstop Monitor...
         Stopping Phosphor Console Muxer listening on device /dev/ttyVUART0...
         Stopping Phosphor IPMI BT DBus Bridge...
         Stopping Phosphor Time Manager daemon...
         Stopping Phosphor BMC0 Control...
         Stopping Phosphor Reset Button0...
         Stopping Phosphor BMC Code Update...
         Stopping Phosphor Version Software Manager...
[  OK  ] Stopped Disable VCS before Witherspoon power on.
[  OK  ] Stopped Enable the AVS bus on VRMs.
[  OK  ] Removed slice system-avsbus\x2denable.slice.
[  OK  ] Stopped Apply voltage overrides to VRMs.
[  OK  ] Stopped Disable the AVS bus on the VRMs.
[  OK  ] Removed slice system-avsbus\x2ddisable.slice.
[  OK  ] Stopped Perform AVS bus workaround on VRMs.
[  OK  ] Removed slice system-avsbus\x2dworkaround.slice.
[  OK  ] Removed slice system-vrm\x2dcontrol.slice.
[  OK  ] Removed slice system-vcs_off.slice.
[  OK  ] Stopped Disable UCD VCS rails.
[  OK  ] Removed slice system-ucd_disable_vcs.slice.
         Stopping Phosphor Chassis State Manager...
[  OK  ] Stopped Wait for /xyz/openbmc_project/inventory.
         Stopping Phosphor Inventory Manager...
[  OK  ] Stopped Phosphor Inventory Manager.
[  OK  ] Stopped Phosphor Version Software Manager.
[  OK  ] Stopped Phosphor BMC0 Control.
[  OK  ] Stopped Phosphor Event Manager.
[  OK  ] Stopped Phosphor IPMI BT DBus Bridge.
[  OK  ] Stopped Phosphor Sensor Manager.
[  OK  ] Stopped Kernel Logging Service.
[  OK  ] Stopped Phosphor LED Group Management Daemon.
[  OK  ] Stopped Phosphor Chassis0 Control.
[  OK  ] Stopped System Logging Service.
[  OK  ] Stopped Phosphor Download Manager.
[  OK  ] Stopped Phosphor User Manager.
[  OK  ] Stopped Phosphor Watchdog0.
[  OK  ] Stopped Phosphor Log Manager.
[  OK  ] Stopped Avahi mDNS/DNS-SD Stack.
[  OK  ] Stopped Phosphor Network Manager.
[  OK  ] Stopped Phosphor Inventory Manager.
[  OK  ] Stopped Lightweight SLP Server.
[  OK  ] Stopped Phosphor sysfs LED controller.
[  OK  ] Stopped Phosphor sysfs LED controller.
[  OK  ] Stopped Phosphor sysfs LED controller.
[  OK  ] Stopped Phosphor Console Muxer listening on device /dev/ttyVUART0.
[  OK  ] Stopped Phosphor MBOX Daemon.
[  OK  ] Stopped Serial Getty on ttyS4.
[  OK  ] Stopped Phoshpor System Manager.
[  OK  ] Stopped Phosphor Webserver.
[  OK  ] Stopped Phosphor DNS Settings Cache Manager.
[  OK  ] Stopped Phosphor Fan Control.
[  OK  ] Stopped Phosphor BMC Code Update.
[  OK  ] Stopped Phosphor BIOS Code Update.
[  OK  ] Stopped Phosphor Host0 Checkstop Monitor.
[  OK  ] Stopped Phosphor Power0 Control.
[  OK  ] Stopped Phosphor BMC State Manager.
[  OK  ] Stopped Phosphor Host State Manager.
[  OK  ] Stopped Phosphor Reset Button0.
[  OK  ] Stopped Phosphor Chassis State Manager.
[  OK  ] Stopped Phosphor Power Button0.
[  OK  ] Stopped Phosphor Inband IPMI.
[  OK  ] Stopped Phosphor Time Manager daemon.
[  OK  ] Stopped Phosphor Hwmon Poller.
[  OK  ] Stopped SSH Per-Connection Server (9.41.165.94:47800).
[  OK  ] Stopped SSH Per-Connection Server (9.41.165.94:48182).
[  OK  ] Removed slice system-dropbear.slice.
[  OK  ] Removed slice system-xyz.openbmc_project.Hwmon.slice.
[  OK  ] Stopped Wait for /org/openbmc/control/power0.
[  OK  ] Stopped Wait for /org/openbmc/settings/host0.
         Stopping Phosphor Settings Manager...
[  OK  ] Stopped Clear one time boot overrides.
[  OK  ] Removed slice system-org.openbmc.buttons.Power.slice.
[  OK  ] Removed slice system-org.openbmc.buttons.reset.slice.
[  OK  ] Removed slice system-org.openbmc.control.Power.slice.
[  OK  ] Removed slice system-org.openbmc.control.Checkstop.slice.
[  OK  ] Stopped Wait for /org/openbmc/managers/System.
[  OK  ] Stopped Wait for /org/openbmc/control/bmc0.
[  OK  ] Stopped Wait for /org/openbmc/managers/Download.
[  OK  ] Stopped target Fans Ready.
         Stopping Phosphor Hwmon Poller...
[  OK  ] Stopped Wait for /org/openbmc/NetworkManager/Interface.
[  OK  ] Stopped target Webserver (pre).
[  OK  ] Removed slice system-serial\x2dgetty.slice.
         Stopping Permit User Sessions...
[  OK  ] Removed slice system-obmc\x2dconsole.slice.
[  OK  ] Removed slice system-xyz.openbmc_project.led.controller.slice.
[  OK  ] Removed slice system-org.openbmc.watchdog.Host.slice.
[  OK  ] Removed slice system-org.openbmc.control.Chassis.slice.
[  OK  ] Removed slice system-org.openbmc.control.Bmc.slice.
[  OK  ] Stopped Phosphor Settings Manager.
[  OK  ] Stopped Phosphor Hwmon Poller.
[  OK  ] Stopped Permit User Sessions.
[  OK  ] Stopped target Remote File Systems.
[  OK  ] Stopped target Network.
         Stopping Network Service...
[  OK  ] Stopped Wait for /org/openbmc/sensors.
[  OK  ] Removed slice system-mapper\x2dwait.slice.
[  OK  ] Stopped target Phosphor Object Mapper.
         Stopping Phosphor DBus Service Discovery Manager...
[  OK  ] Stopped Network Service.
[  OK  ] Stopped Phosphor DBus Service Discovery Manager.
ftgmac100 1e660000.ethernet eth0: [ISR] = 0x57: NO_RXBUF
ftgmac100 1e660000.ethernet eth0: [ISR] = 0x57: NO_RXBUF
ftgmac100 1e660000.ethernet eth0: [ISR] = 0x57: NO_RXBUF
ftgmac100 1e660000.ethernet eth0: [ISR] = 0x57: NO_RXBUF
gkeishin commented 7 years ago

I think this problem went back when the BMC reboot was implemented https://github.com/openbmc/openbmc/issues/1070

so we may want to look at how the BMC reboot is actually forcing BMC to reset @andrewg

geissonator commented 7 years ago

If you search for "systemd reboot.target hangs" you'll see quite a few hits. @saqibkh is working on updating systemd next for us in openbmc, I'm hoping we maybe pull in a fix there.

sivassrr commented 7 years ago

Didnt observe much difference in serial connection output during the normal IPL and BMC hang time.

saqibkh commented 7 years ago

@sivassrr systemd was upgraded to version 232 today. Please test the master branch for related failures.

sivassrr commented 7 years ago

Could able to recreate with driver 682. Here is the combined test report. Last 3 iteration complete failure. At-least till 7th iteration some test cases passed. logwitherspoonCombinedIterations10Report.pdf

sivassrr commented 7 years ago

With the latest BMC driver , And latest PNOR, Still hit the same issue. BMC failed to ssh, Ping success.

root@witherspoon:~# cat /etc/timestamp 20170519212253 root@witherspoon:~# cat /etc/os-release ID="openbmc-phosphor" NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)" VERSION="v1.99.5-60" VERSION_ID="v1.99.5-60-gecfa662" PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v1.99. 5-60" BUILD_ID="v1.99.5" root@witherspoon:~#

gkeishin commented 7 years ago
•   BUILD ID   : https://openpower.xyz/job/openbmc-build/766/
•   VERSION_ID : v1.99.6-10-g1d22c22
•   BMC Kernel : Linux witherspoon 4.10.5-8736136e756ec127de0bbe2c7e2de683204d4512 #1 Fri May 26 05:22:23 UTC 2017 armv6l GNU/Linux

we are hit on new kernel 4.10

geissonator commented 7 years ago

Here's the same section in the shutdown, but on one that works, anyone see an obvious service that's not stopping correctly in the fail conditions?

[  OK  ] Stopped Phosphor LED Group Management Daemon.
[  OK  ] Stopped OpenBMC Software Update Manager.
[  OK  ] Stopped OpenPower Software Update Manager.
[  OK  ] Stopped Phosphor Settings Manager.
[  OK  ] Stopped Phosphor Hwmon Poller.
[  OK  ] Stopped Permit User Sessions.
[  OK  ] Stopped target Remote File Systems.
[  OK  ] Stopped target Network.
         Stopping Network Service...
[  OK  ] Stopped Wait for /org/openbmc/sensors.
[  OK  ] Removed slice system-mapper\x2dwait.slice.
[  OK  ] Stopped target Phosphor Object Mapper.
         Stopping Phosphor DBus Service Discovery Manager...
[  OK  ] Stopped Phosphor DBus Service Discovery Manager.
[  OK  ] Stopped Network Service.
         Stopping D-Bus System Message Bus...
[  OK  ] Stopped D-Bus System Message Bus.
[  OK  ] Stopped target Basic System.
[  OK  ] Stopped target Paths.
[  OK  ] Stopped Forward Password Requests to Wall Directory Watch.
[  OK  ] Stopped Dispatch Password Requests to Console Directory Watch.
[  OK  ] Stopped target Slices.
[  OK  ] Stopped target Sockets.
[  OK  ] Closed Phosphor Host Console SSH Per-Connection socket.
[  OK  ] Closed Phosphor DBUS Browser socket.
[  OK  ] Closed Phosphor Webserver socket.
[  OK  ] Closed phosphor-ipmi-net.socket.
[  OK  ] Closed dropbear.socket.
[  OK  ] Closed Avahi mDNS/DNS-SD Stack Activation Socket.
[  OK  ] Closed Syslog Socket.
[  OK  ] Closed D-Bus System Message Bus Socket.
[  OK  ] Stopped target System Initialization.
geissonator commented 7 years ago

Spent some time on this last week and today. It seems like the issue here is that systemd is getting hung up on the reboot. Patrick and I found a few interesting reads in this area:

I tried the solution proposed in one but I think our dbus.service is required by basic.target so making it run Before= caused a cyclical dependency.

One theory is that the automated test software is coming in via REST during the reboot process and starting up the REST/network service. I've asked @gkeishin to try a test where he adds a delay between the reboot request and the first SSH/REST interaction as a data point.

geissonator commented 7 years ago

Been running a recreate attempt with the following:

vi /etc/systemd/journald.conf
ForwardToConsole=yes

So far no luck, getting the journal to the console on the fail should really help us debug the problem, but if it's a race condition, this may end up changing the timings.

I also had a bash loop going in parallel doing curl command logins.