openbmc / openbmc

OpenBMC Distribution
Other
1.85k stars 904 forks source link

WARN in ncsi_start_channel_monitor code at boot #564

Closed gkeishin closed 7 years ago

gkeishin commented 8 years ago

The base issue https://github.com/openbmc/openbmc/issues/563

Problem : The NCSI crashed and the BMC connection went offline from network.

We also see the serial console hung eventually while running test which may not be related to this.. Was able to get back online via PDU power cycle but its consistently landing into problem.

From Serial console:

barreleye login: ftgmac100 1e660000.ethernet eth0: NCSI interface up random: nonblocking pool is initialized ------------[ cut here ]------------ WARNING: CPU: 0 PID: 4 at /var/lib/jenkins/workspace/openbmc-build-gerrit-trigger-multi/distro/ubuntu/target/barreleye/openbmc/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88 CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1 Hardware name: ASpeed SoC Workqueue: events ncsi_dev_work from from from from from from from from from from ---[ end trace 2f3901588ebb5c52 ]--- ftgmac100 1e660000.ethernet eth0: NCSI interface up

Another post PDU cycle log

root@barreleye:~# ------------[ cut here ]------------ WARNING: CPU: 0 PID: 742 at /var/lib/jenkins/workspace/openbmc-build-gerrit-trigger-multi/distro/ubuntu/target/barreleye/openbmc/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88 CPU: 0 PID: 742 Comm: kworker/0:5 Not tainted 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1 Hardware name: ASpeed SoC Workqueue: events ncsi_dev_work from from from from from from from from from from ---[ end trace 98a124f244ab4360 ]--- ftgmac100 1e660000.ethernet eth0: NCSI interface up

gkeishin commented 8 years ago

Since the crash trace is not showing properly above.. attaching the text version of it here

---> NCSI_LOG.txt

shenki commented 8 years ago

WARN at boot time:

barreleye login: ftgmac100 1e660000.ethernet eth0: NCSI interface up
random: nonblocking pool is initialized
------------[ cut here ]------------
WARNING: CPU: 0 PID: 4 at /var/lib/jenkins/workspace/openbmc-build-gerrit-trigger-multi/distro/ubuntu/target/barreleye/openbmc/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88
CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1
Hardware name: ASpeed SoC
Workqueue: events ncsi_dev_work
[<c0106cbc>] (unwind_backtrace) from [<c0104f04>] (show_stack+0x10/0x14)
[<c0104f04>] (show_stack) from [<c010ec3c>] (__warn+0xc4/0xf0)
[<c010ec3c>] (__warn) from [<c010ed5c>] (warn_slowpath_null+0x1c/0x24)
[<c010ed5c>] (warn_slowpath_null) from [<c03f4174>] (ncsi_start_channel_monitor+0x44/0x88)
[<c03f4174>] (ncsi_start_channel_monitor) from [<c03f4e08>] (ncsi_configure_channel+0x258/0x2b8)
[<c03f4e08>] (ncsi_configure_channel) from [<c03f5558>] (ncsi_dev_work+0x39c/0x3e8)
[<c03f5558>] (ncsi_dev_work) from [<c011fee4>] (process_one_work+0x1b8/0x2fc)
[<c011fee4>] (process_one_work) from [<c0120318>] (worker_thread+0x2c0/0x3f8)
[<c0120318>] (worker_thread) from [<c012476c>] (kthread+0xd0/0xe8)
[<c012476c>] (kthread) from [<c0102250>] (ret_from_fork+0x14/0x24)
---[ end trace 2f3901588ebb5c52 ]---
ftgmac100 1e660000.ethernet eth0: NCSI interface up

Sep 08 21:21:09 barreleye kernel: Linux version 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 (jenkins@hudson) (gcc version 5.3.0 (GCC) ) #1 Wed Aug 17 10:17:55 UTC 2016

A second WARN at boot. This time the network interface appears to have been brought down and up a a few times

Sep 08 21:23:00 barreleye systemd[1]: Started Read system/chassis/motherboard EEPROM.
Sep 08 21:23:00 barreleye systemd[1]: Starting Sync system/chassis/motherboard...
Sep 08 21:23:06 barreleye netman.py[675]: Restarting networkd service...
Sep 08 21:23:06 barreleye inarp[695]: updating interface: eth0, [00:01:6c:80:02:78]
Sep 08 21:23:06 barreleye systemd-networkd[691]: eth0: Lost carrier
Sep 08 21:23:06 barreleye inarp[695]: updating interface: eth0, [00:01:6c:80:02:78]
Sep 08 21:23:06 barreleye inarp[695]: updating interface: eth0, [00:01:6c:80:02:78]
Sep 08 21:23:06 barreleye systemd-networkd[691]: eth0: Gained carrier
Sep 08 21:23:06 barreleye systemd-timesyncd[669]: Network configuration changed, trying to establish connection.
Sep 08 21:23:06 barreleye systemd[1]: Stopping Network Service...
Sep 08 21:23:06 barreleye systemd[1]: Closed Phosphor Webserver socket.
Sep 08 21:23:06 barreleye systemd[1]: Stopping Phosphor Webserver socket.
Sep 08 21:23:06 barreleye systemd[1]: Listening on Phosphor Webserver socket.
Sep 08 21:23:06 barreleye systemd[1]: Stopped Network Service.
Sep 08 21:23:06 barreleye systemd[1]: Starting Network Service...
Sep 08 21:23:06 barreleye systemd-timesyncd[669]: Network configuration changed, trying to establish connection.
Sep 08 21:23:06 barreleye systemd-networkd[842]: Enumeration completed
Sep 08 21:23:06 barreleye systemd-timesyncd[669]: Network configuration changed, trying to establish connection.
Sep 08 21:23:06 barreleye systemd-timesyncd[669]: Network configuration changed, trying to establish connection.
Sep 08 21:23:06 barreleye systemd-timesyncd[669]: Network configuration changed, trying to establish connection.
Sep 08 21:23:07 barreleye kernel: ------------[ cut here ]------------
Sep 08 21:23:07 barreleye kernel: WARNING: CPU: 0 PID: 742 at /var/lib/jenkins/workspace/openbmc-build-gerrit-trigger-multi/distro/ubuntu/target/barreleye/openbmc/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88
Sep 08 21:23:07 barreleye kernel: CPU: 0 PID: 742 Comm: kworker/0:5 Not tainted 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1
Sep 08 21:23:07 barreleye kernel: Hardware name: ASpeed SoC
Sep 08 21:23:07 barreleye kernel: Workqueue: events ncsi_dev_work
Sep 08 21:23:07 barreleye kernel: [<c0106cbc>] (unwind_backtrace) from [<c0104f04>] (show_stack+0x10/0x14)
Sep 08 21:23:07 barreleye kernel: [<c0104f04>] (show_stack) from [<c010ec3c>] (__warn+0xc4/0xf0)
Sep 08 21:23:07 barreleye kernel: [<c010ec3c>] (__warn) from [<c010ed5c>] (warn_slowpath_null+0x1c/0x24)
Sep 08 21:23:07 barreleye kernel: [<c010ed5c>] (warn_slowpath_null) from [<c03f4174>] (ncsi_start_channel_monitor+0x44/0x88)
Sep 08 21:23:07 barreleye kernel: [<c03f4174>] (ncsi_start_channel_monitor) from [<c03f4e08>] (ncsi_configure_channel+0x258/0x2b8)
Sep 08 21:23:07 barreleye kernel: [<c03f4e08>] (ncsi_configure_channel) from [<c03f5558>] (ncsi_dev_work+0x39c/0x3e8)
Sep 08 21:23:07 barreleye kernel: [<c03f5558>] (ncsi_dev_work) from [<c011fee4>] (process_one_work+0x1b8/0x2fc)
Sep 08 21:23:07 barreleye kernel: [<c011fee4>] (process_one_work) from [<c0120318>] (worker_thread+0x2c0/0x3f8)
Sep 08 21:23:07 barreleye kernel: [<c0120318>] (worker_thread) from [<c012476c>] (kthread+0xd0/0xe8)
Sep 08 21:23:07 barreleye kernel: [<c012476c>] (kthread) from [<c0102250>] (ret_from_fork+0x14/0x24)
Sep 08 21:23:07 barreleye kernel: ---[ end trace 98a124f244ab4360 ]---
Sep 08 21:23:07 barreleye kernel: ftgmac100 1e660000.ethernet eth0: NCSI interface up
Sep 08 21:23:07 barreleye kernel[679]: ------------[ cut here ]------------
Sep 08 21:23:07 barreleye systemd-timesyncd[669]: Network configuration changed, trying to establish connection.
Sep 08 21:23:07 barreleye kernel[679]: WARNING: CPU: 0 PID: 742 at /var/lib/jenkins/workspace/openbmc-build-gerrit-trigger-multi/distro/ubuntu/target/barreleye/openbmc/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88
Sep 08 21:23:07 barreleye kernel[679]: CPU: 0 PID: 742 Comm: kworker/0:5 Not tainted 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1
Sep 08 21:23:07 barreleye kernel[679]: Hardware name: ASpeed SoC
Sep 08 21:23:07 barreleye kernel[679]: Workqueue: events ncsi_dev_work
Sep 08 21:23:07 barreleye kernel[679]: [<c0106cbc>] (unwind_backtrace) from [<c0104f04>] (show_stack+0x10/0x14)
Sep 08 21:23:07 barreleye kernel[679]: [<c0104f04>] (show_stack) from [<c010ec3c>] (__warn+0xc4/0xf0)
Sep 08 21:23:07 barreleye kernel[679]: [<c010ec3c>] (__warn) from [<c010ed5c>] (warn_slowpath_null+0x1c/0x24)
Sep 08 21:23:07 barreleye kernel[679]: [<c010ed5c>] (warn_slowpath_null) from [<c03f4174>] (ncsi_start_channel_monitor+0x44/0x88)
Sep 08 21:23:07 barreleye kernel[679]: [<c03f4174>] (ncsi_start_channel_monitor) from [<c03f4e08>] (ncsi_configure_channel+0x258/0x2b8)
Sep 08 21:23:07 barreleye kernel[679]: [<c03f4e08>] (ncsi_configure_channel) from [<c03f5558>] (ncsi_dev_work+0x39c/0x3e8)
Sep 08 21:23:07 barreleye kernel[679]: [<c03f5558>] (ncsi_dev_work) from [<c011fee4>] (process_one_work+0x1b8/0x2fc)
Sep 08 21:23:07 barreleye kernel[679]: [<c011fee4>] (process_one_work) from [<c0120318>] (worker_thread+0x2c0/0x3f8)
Sep 08 21:23:07 barreleye kernel[679]: [<c0120318>] (worker_thread) from [<c012476c>] (kthread+0xd0/0xe8)
Sep 08 21:23:07 barreleye kernel[679]: [<c012476c>] (kthread) from [<c0102250>] (ret_from_fork+0x14/0x24)
Sep 08 21:23:07 barreleye kernel[679]: ---[ end trace 98a124f244ab4360 ]---
Sep 08 21:23:07 barreleye kernel[679]: ftgmac100 1e660000.ethernet eth0: NCSI interface up
Sep 08 21:23:08 barreleye sync_inventory_items.py[832]: Rebooting BMC to set uuid
shenki commented 8 years ago

This is not a crash as originally stated. The warning is to tell developers about an unexpected condition in the driver.

I also notice that the interface was up, and then you rebooted the BMC. Was this reboot initiated over the network, or from the serial console?

shenki commented 8 years ago

@gwshan this looks similar to the issues you are chasing.

https://github.com/openbmc/openbmc/issues/512#issuecomment-246270147

gkeishin commented 8 years ago

@shenki First from post BMC code update and second from PDU power cycle reboot

shenki commented 8 years ago

Thanks.

@gwshan this fits the pattern that I saw where I could trigger the WARN though using pflash to update the host pnor.

I think that certain flash related loads are starving the ncsi monitoring kernel thread.

geissonator commented 8 years ago

We had a barreleye (BarE01) just sitting doing nothing, suddenly it didn't ping and I saw this in the console output:

systemd[1]: Started Apply Kernel Variables.
systemd[1]: Started Journal Service.
systemd-journald[407]: Received request to flush runtime journal from PID 1
ftgmac100 1e660000.ethernet eth0: NCSI interface up
random: nonblocking pool is initialized
------------[ cut here ]------------
WARNING: CPU: 0 PID: 406 at /var/lib/jenkins/workspace/openbmc-build-gerrit-trigger-multi/distro/ubuntu/target/barreleye/openbmc/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88
CPU: 0 PID: 406 Comm: kworker/0:2 Not tainted 4.7.3-be2001133835be80ba4657232b823ba2daaac0d6 #1
Hardware name: ASpeed SoC
Workqueue: events ncsi_dev_work
[<c0106cbc>] (unwind_backtrace) from [<c0104f04>] (show_stack+0x10/0x14)
[<c0104f04>] (show_stack) from [<c010eda4>] (__warn+0xc4/0xf0)
[<c010eda4>] (__warn) from [<c010eec4>] (warn_slowpath_null+0x1c/0x24)
[<c010eec4>] (warn_slowpath_null) from [<c03f454c>] (ncsi_start_channel_monitor+0x44/0x88)
[<c03f454c>] (ncsi_start_channel_monitor) from [<c03f51e0>] (ncsi_configure_channel+0x258/0x2b8)
[<c03f51e0>] (ncsi_configure_channel) from [<c03f5930>] (ncsi_dev_work+0x39c/0x3e8)
[<c03f5930>] (ncsi_dev_work) from [<c012004c>] (process_one_work+0x1b8/0x2fc)
[<c012004c>] (process_one_work) from [<c0120480>] (worker_thread+0x2c0/0x3f8)
[<c0120480>] (worker_thread) from [<c01248d4>] (kthread+0xd0/0xe8)
[<c01248d4>] (kthread) from [<c0102250>] (ret_from_fork+0x14/0x24)
---[ end trace f17a720d2ca946dd ]---
ftgmac100 1e660000.ethernet eth0: NCSI interface up

Running fairly new stuff

root@barreleye:~# uname -a Linux barreleye 4.7.3-be2001133835be80ba4657232b823ba2daaac0d6 #1 Wed Sep 14 12:12:05 UTC 2016 armv5tejl GNU/Linux root@barreleye:~# cat /etc/os-release ID="openbmc-phosphor" NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)" VERSION="v1.99.0-114" VERSION_ID="v1.99.0-114-ged52dc1-dirty" PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v1.99.0-114" BUILD_ID="v1.99.0"

geissonator commented 8 years ago

BarD04 also hit this today. Network stopped working for a bit, and then suddenly came back:

systemd[1]: Started Journal Service.
systemd-journald[410]: Received request to flush runtime journal from PID 1
ftgmac100 1e660000.ethernet eth0: NCSI interface up
random: nonblocking pool is initialized
------------[ cut here ]------------
WARNING: CPU: 0 PID: 401 at /tmp/anoo/obmc/barreleye-master/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88
CPU: 0 PID: 401 Comm: kworker/0:2 Not tainted 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1
Hardware name: ASpeed SoC
Workqueue: events ncsi_dev_work
[<c0106cbc>] (unwind_backtrace) from [<c0104f04>] (show_stack+0x10/0x14)
[<c0104f04>] (show_stack) from [<c010ec3c>] (__warn+0xc4/0xf0)
[<c010ec3c>] (__warn) from [<c010ed5c>] (warn_slowpath_null+0x1c/0x24)
[<c010ed5c>] (warn_slowpath_null) from [<c03f4174>] (ncsi_start_channel_monitor+0x44/0x88)
[<c03f4174>] (ncsi_start_channel_monitor) from [<c03f4e08>] (ncsi_configure_channel+0x258/0x2b8)
[<c03f4e08>] (ncsi_configure_channel) from [<c03f5558>] (ncsi_dev_work+0x39c/0x3e8)
[<c03f5558>] (ncsi_dev_work) from [<c011fee4>] (process_one_work+0x1b8/0x2fc)
[<c011fee4>] (process_one_work) from [<c0120318>] (worker_thread+0x2c0/0x3f8)
[<c0120318>] (worker_thread) from [<c012476c>] (kthread+0xd0/0xe8)
[<c012476c>] (kthread) from [<c0102250>] (ret_from_fork+0x14/0x24)
---[ end trace b366fbedac91fb3c ]---
ftgmac100 1e660000.ethernet eth0: NCSI interface up

root@barreleye:~# uname -a Linux barreleye 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1 Sun Sep 11 10:48:54 CDT 2016 armv5tejl GNU/Linux

gwshan commented 8 years ago

There is one issue I didn't considered when introducing channel monitor. When the IID (Instance ID) in the GLS (Get Link State) command changes from 255 to 1. The BCM5718 never gives response for the command who has 1 for its IID. I already had a patch fixing it and will post it together with other NCSI patches later.

geissonator commented 8 years ago

One interesting tidbit, is once we hit the fail (i.e. you can't ping the BMC externally), if you go into the BMC serial console, and do a "traceroute" to a system within the lab, the ping to the BMC suddenly starts working.

We continue to hit this consistently across all of our systems here in Austin and are eagerly anticipating a fix.

gwshan commented 7 years ago

Copy to internal email exchange to make this complete:

Andrew, thanks a lot for sharing the machines for debugging. it's not surprise to see non-functional network as two BMCs have same MAC addresses. Please ask the network admin to assign the correct (non-conflicting) MAC addresses to the BMCs.

With same MAC address assigned to two BMC, the gateway (switch) is confused: Which BMC I should forward the ingress packet to? So the phenomenon is: one BMC's network works and another BMC's network should stop working.

BarD-04:

ifconfig

eth0 Link encap:Ethernet HWaddr 00:01:6C:80:02:78
inet addr:9.3.23.32 Bcast:9.3.23.255 Mask:255.255.255.0

BarE-0: root@barreleye:~# ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:01:6C:80:02:78
inet addr:9.3.29.119 Bcast:9.3.29.255 Mask:255.255.255.0

shenki commented 7 years ago

Fixed as of c4fb73e744d4eb7dc845031dec9aa0543652c02f

gkeishin commented 7 years ago
root@witherspoon:~# cat /etc/os-release
ID="openbmc-phosphor"
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="v1.99.5-28"
VERSION_ID="v1.99.5-28-g60b4452"
PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v1.99.5-28"
BUILD_ID="v1.99.5"
root@witherspoon:~# uname -a
Linux witherspoon 4.7.10-15c30d4deb0a01f398230395dc0d9af4be92d856 #1 Thu May 11 07:21:01 UTC 2017 armv6l GNU/Linux
root@witherspoon:~#

seeing here again this one

------------[ cut here ]------------
WARNING: CPU: 0 PID: 847 at /var/lib/jenkins-slave/workspace/openbmc-build/distro/ubuntu/target/witherspoon/openbmc/build/tmp/work-shared/witherspoon/kernel-source/net/ncsi/ncsi-manage.c:240 ncsi_start_channel_monitor+0x40/0x78
CPU: 0 PID: 847 Comm: kworker/0:6 Not tainted 4.7.10-15c30d4deb0a01f398230395dc0d9af4be92d856 #1
Hardware name: ASpeed SoC
Workqueue: events ncsi_dev_work
[<8010723c>] (unwind_backtrace) from [<801054c4>] (show_stack+0x10/0x14)
[<801054c4>] (show_stack) from [<80110be0>] (__warn+0xd8/0xf4)
[<80110be0>] (__warn) from [<80110cf0>] (warn_slowpath_null+0x1c/0x24)
[<80110cf0>] (warn_slowpath_null) from [<80435a34>] (ncsi_start_channel_monitor+0x40/0x78)
[<80435a34>] (ncsi_start_channel_monitor) from [<804366c0>] (ncsi_configure_channel+0x294/0x2d4)
[<804366c0>] (ncsi_configure_channel) from [<80436e88>] (ncsi_dev_work+0x3a8/0x3d8)
[<80436e88>] (ncsi_dev_work) from [<80122008>] (process_one_work+0x18c/0x2f0)
[<80122008>] (process_one_work) from [<8012241c>] (worker_thread+0x280/0x3cc)
[<8012241c>] (worker_thread) from [<801267ec>] (kthread+0xdc/0xf0)
[<801267ec>] (kthread) from [<80102278>] (ret_from_fork+0x14/0x3c)
---[ end trace c8b731184dbdbd8e ]---