Closed gkeishin closed 7 years ago
Since the crash trace is not showing properly above.. attaching the text version of it here
---> NCSI_LOG.txt
WARN at boot time:
barreleye login: ftgmac100 1e660000.ethernet eth0: NCSI interface up
random: nonblocking pool is initialized
------------[ cut here ]------------
WARNING: CPU: 0 PID: 4 at /var/lib/jenkins/workspace/openbmc-build-gerrit-trigger-multi/distro/ubuntu/target/barreleye/openbmc/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88
CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1
Hardware name: ASpeed SoC
Workqueue: events ncsi_dev_work
[<c0106cbc>] (unwind_backtrace) from [<c0104f04>] (show_stack+0x10/0x14)
[<c0104f04>] (show_stack) from [<c010ec3c>] (__warn+0xc4/0xf0)
[<c010ec3c>] (__warn) from [<c010ed5c>] (warn_slowpath_null+0x1c/0x24)
[<c010ed5c>] (warn_slowpath_null) from [<c03f4174>] (ncsi_start_channel_monitor+0x44/0x88)
[<c03f4174>] (ncsi_start_channel_monitor) from [<c03f4e08>] (ncsi_configure_channel+0x258/0x2b8)
[<c03f4e08>] (ncsi_configure_channel) from [<c03f5558>] (ncsi_dev_work+0x39c/0x3e8)
[<c03f5558>] (ncsi_dev_work) from [<c011fee4>] (process_one_work+0x1b8/0x2fc)
[<c011fee4>] (process_one_work) from [<c0120318>] (worker_thread+0x2c0/0x3f8)
[<c0120318>] (worker_thread) from [<c012476c>] (kthread+0xd0/0xe8)
[<c012476c>] (kthread) from [<c0102250>] (ret_from_fork+0x14/0x24)
---[ end trace 2f3901588ebb5c52 ]---
ftgmac100 1e660000.ethernet eth0: NCSI interface up
Sep 08 21:21:09 barreleye kernel: Linux version 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 (jenkins@hudson) (gcc version 5.3.0 (GCC) ) #1 Wed Aug 17 10:17:55 UTC 2016
A second WARN at boot. This time the network interface appears to have been brought down and up a a few times
Sep 08 21:23:00 barreleye systemd[1]: Started Read system/chassis/motherboard EEPROM.
Sep 08 21:23:00 barreleye systemd[1]: Starting Sync system/chassis/motherboard...
Sep 08 21:23:06 barreleye netman.py[675]: Restarting networkd service...
Sep 08 21:23:06 barreleye inarp[695]: updating interface: eth0, [00:01:6c:80:02:78]
Sep 08 21:23:06 barreleye systemd-networkd[691]: eth0: Lost carrier
Sep 08 21:23:06 barreleye inarp[695]: updating interface: eth0, [00:01:6c:80:02:78]
Sep 08 21:23:06 barreleye inarp[695]: updating interface: eth0, [00:01:6c:80:02:78]
Sep 08 21:23:06 barreleye systemd-networkd[691]: eth0: Gained carrier
Sep 08 21:23:06 barreleye systemd-timesyncd[669]: Network configuration changed, trying to establish connection.
Sep 08 21:23:06 barreleye systemd[1]: Stopping Network Service...
Sep 08 21:23:06 barreleye systemd[1]: Closed Phosphor Webserver socket.
Sep 08 21:23:06 barreleye systemd[1]: Stopping Phosphor Webserver socket.
Sep 08 21:23:06 barreleye systemd[1]: Listening on Phosphor Webserver socket.
Sep 08 21:23:06 barreleye systemd[1]: Stopped Network Service.
Sep 08 21:23:06 barreleye systemd[1]: Starting Network Service...
Sep 08 21:23:06 barreleye systemd-timesyncd[669]: Network configuration changed, trying to establish connection.
Sep 08 21:23:06 barreleye systemd-networkd[842]: Enumeration completed
Sep 08 21:23:06 barreleye systemd-timesyncd[669]: Network configuration changed, trying to establish connection.
Sep 08 21:23:06 barreleye systemd-timesyncd[669]: Network configuration changed, trying to establish connection.
Sep 08 21:23:06 barreleye systemd-timesyncd[669]: Network configuration changed, trying to establish connection.
Sep 08 21:23:07 barreleye kernel: ------------[ cut here ]------------
Sep 08 21:23:07 barreleye kernel: WARNING: CPU: 0 PID: 742 at /var/lib/jenkins/workspace/openbmc-build-gerrit-trigger-multi/distro/ubuntu/target/barreleye/openbmc/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88
Sep 08 21:23:07 barreleye kernel: CPU: 0 PID: 742 Comm: kworker/0:5 Not tainted 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1
Sep 08 21:23:07 barreleye kernel: Hardware name: ASpeed SoC
Sep 08 21:23:07 barreleye kernel: Workqueue: events ncsi_dev_work
Sep 08 21:23:07 barreleye kernel: [<c0106cbc>] (unwind_backtrace) from [<c0104f04>] (show_stack+0x10/0x14)
Sep 08 21:23:07 barreleye kernel: [<c0104f04>] (show_stack) from [<c010ec3c>] (__warn+0xc4/0xf0)
Sep 08 21:23:07 barreleye kernel: [<c010ec3c>] (__warn) from [<c010ed5c>] (warn_slowpath_null+0x1c/0x24)
Sep 08 21:23:07 barreleye kernel: [<c010ed5c>] (warn_slowpath_null) from [<c03f4174>] (ncsi_start_channel_monitor+0x44/0x88)
Sep 08 21:23:07 barreleye kernel: [<c03f4174>] (ncsi_start_channel_monitor) from [<c03f4e08>] (ncsi_configure_channel+0x258/0x2b8)
Sep 08 21:23:07 barreleye kernel: [<c03f4e08>] (ncsi_configure_channel) from [<c03f5558>] (ncsi_dev_work+0x39c/0x3e8)
Sep 08 21:23:07 barreleye kernel: [<c03f5558>] (ncsi_dev_work) from [<c011fee4>] (process_one_work+0x1b8/0x2fc)
Sep 08 21:23:07 barreleye kernel: [<c011fee4>] (process_one_work) from [<c0120318>] (worker_thread+0x2c0/0x3f8)
Sep 08 21:23:07 barreleye kernel: [<c0120318>] (worker_thread) from [<c012476c>] (kthread+0xd0/0xe8)
Sep 08 21:23:07 barreleye kernel: [<c012476c>] (kthread) from [<c0102250>] (ret_from_fork+0x14/0x24)
Sep 08 21:23:07 barreleye kernel: ---[ end trace 98a124f244ab4360 ]---
Sep 08 21:23:07 barreleye kernel: ftgmac100 1e660000.ethernet eth0: NCSI interface up
Sep 08 21:23:07 barreleye kernel[679]: ------------[ cut here ]------------
Sep 08 21:23:07 barreleye systemd-timesyncd[669]: Network configuration changed, trying to establish connection.
Sep 08 21:23:07 barreleye kernel[679]: WARNING: CPU: 0 PID: 742 at /var/lib/jenkins/workspace/openbmc-build-gerrit-trigger-multi/distro/ubuntu/target/barreleye/openbmc/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88
Sep 08 21:23:07 barreleye kernel[679]: CPU: 0 PID: 742 Comm: kworker/0:5 Not tainted 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1
Sep 08 21:23:07 barreleye kernel[679]: Hardware name: ASpeed SoC
Sep 08 21:23:07 barreleye kernel[679]: Workqueue: events ncsi_dev_work
Sep 08 21:23:07 barreleye kernel[679]: [<c0106cbc>] (unwind_backtrace) from [<c0104f04>] (show_stack+0x10/0x14)
Sep 08 21:23:07 barreleye kernel[679]: [<c0104f04>] (show_stack) from [<c010ec3c>] (__warn+0xc4/0xf0)
Sep 08 21:23:07 barreleye kernel[679]: [<c010ec3c>] (__warn) from [<c010ed5c>] (warn_slowpath_null+0x1c/0x24)
Sep 08 21:23:07 barreleye kernel[679]: [<c010ed5c>] (warn_slowpath_null) from [<c03f4174>] (ncsi_start_channel_monitor+0x44/0x88)
Sep 08 21:23:07 barreleye kernel[679]: [<c03f4174>] (ncsi_start_channel_monitor) from [<c03f4e08>] (ncsi_configure_channel+0x258/0x2b8)
Sep 08 21:23:07 barreleye kernel[679]: [<c03f4e08>] (ncsi_configure_channel) from [<c03f5558>] (ncsi_dev_work+0x39c/0x3e8)
Sep 08 21:23:07 barreleye kernel[679]: [<c03f5558>] (ncsi_dev_work) from [<c011fee4>] (process_one_work+0x1b8/0x2fc)
Sep 08 21:23:07 barreleye kernel[679]: [<c011fee4>] (process_one_work) from [<c0120318>] (worker_thread+0x2c0/0x3f8)
Sep 08 21:23:07 barreleye kernel[679]: [<c0120318>] (worker_thread) from [<c012476c>] (kthread+0xd0/0xe8)
Sep 08 21:23:07 barreleye kernel[679]: [<c012476c>] (kthread) from [<c0102250>] (ret_from_fork+0x14/0x24)
Sep 08 21:23:07 barreleye kernel[679]: ---[ end trace 98a124f244ab4360 ]---
Sep 08 21:23:07 barreleye kernel[679]: ftgmac100 1e660000.ethernet eth0: NCSI interface up
Sep 08 21:23:08 barreleye sync_inventory_items.py[832]: Rebooting BMC to set uuid
This is not a crash as originally stated. The warning is to tell developers about an unexpected condition in the driver.
I also notice that the interface was up, and then you rebooted the BMC. Was this reboot initiated over the network, or from the serial console?
@gwshan this looks similar to the issues you are chasing.
https://github.com/openbmc/openbmc/issues/512#issuecomment-246270147
@shenki First from post BMC code update and second from PDU power cycle reboot
Thanks.
@gwshan this fits the pattern that I saw where I could trigger the WARN though using pflash to update the host pnor.
I think that certain flash related loads are starving the ncsi monitoring kernel thread.
We had a barreleye (BarE01) just sitting doing nothing, suddenly it didn't ping and I saw this in the console output:
systemd[1]: Started Apply Kernel Variables.
systemd[1]: Started Journal Service.
systemd-journald[407]: Received request to flush runtime journal from PID 1
ftgmac100 1e660000.ethernet eth0: NCSI interface up
random: nonblocking pool is initialized
------------[ cut here ]------------
WARNING: CPU: 0 PID: 406 at /var/lib/jenkins/workspace/openbmc-build-gerrit-trigger-multi/distro/ubuntu/target/barreleye/openbmc/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88
CPU: 0 PID: 406 Comm: kworker/0:2 Not tainted 4.7.3-be2001133835be80ba4657232b823ba2daaac0d6 #1
Hardware name: ASpeed SoC
Workqueue: events ncsi_dev_work
[<c0106cbc>] (unwind_backtrace) from [<c0104f04>] (show_stack+0x10/0x14)
[<c0104f04>] (show_stack) from [<c010eda4>] (__warn+0xc4/0xf0)
[<c010eda4>] (__warn) from [<c010eec4>] (warn_slowpath_null+0x1c/0x24)
[<c010eec4>] (warn_slowpath_null) from [<c03f454c>] (ncsi_start_channel_monitor+0x44/0x88)
[<c03f454c>] (ncsi_start_channel_monitor) from [<c03f51e0>] (ncsi_configure_channel+0x258/0x2b8)
[<c03f51e0>] (ncsi_configure_channel) from [<c03f5930>] (ncsi_dev_work+0x39c/0x3e8)
[<c03f5930>] (ncsi_dev_work) from [<c012004c>] (process_one_work+0x1b8/0x2fc)
[<c012004c>] (process_one_work) from [<c0120480>] (worker_thread+0x2c0/0x3f8)
[<c0120480>] (worker_thread) from [<c01248d4>] (kthread+0xd0/0xe8)
[<c01248d4>] (kthread) from [<c0102250>] (ret_from_fork+0x14/0x24)
---[ end trace f17a720d2ca946dd ]---
ftgmac100 1e660000.ethernet eth0: NCSI interface up
Running fairly new stuff
root@barreleye:~# uname -a Linux barreleye 4.7.3-be2001133835be80ba4657232b823ba2daaac0d6 #1 Wed Sep 14 12:12:05 UTC 2016 armv5tejl GNU/Linux root@barreleye:~# cat /etc/os-release ID="openbmc-phosphor" NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)" VERSION="v1.99.0-114" VERSION_ID="v1.99.0-114-ged52dc1-dirty" PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v1.99.0-114" BUILD_ID="v1.99.0"
BarD04 also hit this today. Network stopped working for a bit, and then suddenly came back:
systemd[1]: Started Journal Service.
systemd-journald[410]: Received request to flush runtime journal from PID 1
ftgmac100 1e660000.ethernet eth0: NCSI interface up
random: nonblocking pool is initialized
------------[ cut here ]------------
WARNING: CPU: 0 PID: 401 at /tmp/anoo/obmc/barreleye-master/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88
CPU: 0 PID: 401 Comm: kworker/0:2 Not tainted 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1
Hardware name: ASpeed SoC
Workqueue: events ncsi_dev_work
[<c0106cbc>] (unwind_backtrace) from [<c0104f04>] (show_stack+0x10/0x14)
[<c0104f04>] (show_stack) from [<c010ec3c>] (__warn+0xc4/0xf0)
[<c010ec3c>] (__warn) from [<c010ed5c>] (warn_slowpath_null+0x1c/0x24)
[<c010ed5c>] (warn_slowpath_null) from [<c03f4174>] (ncsi_start_channel_monitor+0x44/0x88)
[<c03f4174>] (ncsi_start_channel_monitor) from [<c03f4e08>] (ncsi_configure_channel+0x258/0x2b8)
[<c03f4e08>] (ncsi_configure_channel) from [<c03f5558>] (ncsi_dev_work+0x39c/0x3e8)
[<c03f5558>] (ncsi_dev_work) from [<c011fee4>] (process_one_work+0x1b8/0x2fc)
[<c011fee4>] (process_one_work) from [<c0120318>] (worker_thread+0x2c0/0x3f8)
[<c0120318>] (worker_thread) from [<c012476c>] (kthread+0xd0/0xe8)
[<c012476c>] (kthread) from [<c0102250>] (ret_from_fork+0x14/0x24)
---[ end trace b366fbedac91fb3c ]---
ftgmac100 1e660000.ethernet eth0: NCSI interface up
root@barreleye:~# uname -a Linux barreleye 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1 Sun Sep 11 10:48:54 CDT 2016 armv5tejl GNU/Linux
There is one issue I didn't considered when introducing channel monitor. When the IID (Instance ID) in the GLS (Get Link State) command changes from 255 to 1. The BCM5718 never gives response for the command who has 1 for its IID. I already had a patch fixing it and will post it together with other NCSI patches later.
One interesting tidbit, is once we hit the fail (i.e. you can't ping the BMC externally), if you go into the BMC serial console, and do a "traceroute" to a system within the lab, the ping to the BMC suddenly starts working.
We continue to hit this consistently across all of our systems here in Austin and are eagerly anticipating a fix.
Copy to internal email exchange to make this complete:
Andrew, thanks a lot for sharing the machines for debugging. it's not surprise to see non-functional network as two BMCs have same MAC addresses. Please ask the network admin to assign the correct (non-conflicting) MAC addresses to the BMCs.
With same MAC address assigned to two BMC, the gateway (switch) is confused: Which BMC I should forward the ingress packet to? So the phenomenon is: one BMC's network works and another BMC's network should stop working.
BarD-04:
eth0 Link encap:Ethernet HWaddr 00:01:6C:80:02:78
inet addr:9.3.23.32 Bcast:9.3.23.255 Mask:255.255.255.0
BarE-0:
root@barreleye:~# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:01:6C:80:02:78
inet addr:9.3.29.119 Bcast:9.3.29.255 Mask:255.255.255.0
Fixed as of c4fb73e744d4eb7dc845031dec9aa0543652c02f
root@witherspoon:~# cat /etc/os-release
ID="openbmc-phosphor"
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="v1.99.5-28"
VERSION_ID="v1.99.5-28-g60b4452"
PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) v1.99.5-28"
BUILD_ID="v1.99.5"
root@witherspoon:~# uname -a
Linux witherspoon 4.7.10-15c30d4deb0a01f398230395dc0d9af4be92d856 #1 Thu May 11 07:21:01 UTC 2017 armv6l GNU/Linux
root@witherspoon:~#
seeing here again this one
------------[ cut here ]------------
WARNING: CPU: 0 PID: 847 at /var/lib/jenkins-slave/workspace/openbmc-build/distro/ubuntu/target/witherspoon/openbmc/build/tmp/work-shared/witherspoon/kernel-source/net/ncsi/ncsi-manage.c:240 ncsi_start_channel_monitor+0x40/0x78
CPU: 0 PID: 847 Comm: kworker/0:6 Not tainted 4.7.10-15c30d4deb0a01f398230395dc0d9af4be92d856 #1
Hardware name: ASpeed SoC
Workqueue: events ncsi_dev_work
[<8010723c>] (unwind_backtrace) from [<801054c4>] (show_stack+0x10/0x14)
[<801054c4>] (show_stack) from [<80110be0>] (__warn+0xd8/0xf4)
[<80110be0>] (__warn) from [<80110cf0>] (warn_slowpath_null+0x1c/0x24)
[<80110cf0>] (warn_slowpath_null) from [<80435a34>] (ncsi_start_channel_monitor+0x40/0x78)
[<80435a34>] (ncsi_start_channel_monitor) from [<804366c0>] (ncsi_configure_channel+0x294/0x2d4)
[<804366c0>] (ncsi_configure_channel) from [<80436e88>] (ncsi_dev_work+0x3a8/0x3d8)
[<80436e88>] (ncsi_dev_work) from [<80122008>] (process_one_work+0x18c/0x2f0)
[<80122008>] (process_one_work) from [<8012241c>] (worker_thread+0x280/0x3cc)
[<8012241c>] (worker_thread) from [<801267ec>] (kthread+0xdc/0xf0)
[<801267ec>] (kthread) from [<80102278>] (ret_from_fork+0x14/0x3c)
---[ end trace c8b731184dbdbd8e ]---
[x] Journald log
journald_v1.99.5-28-g60b4452_IPL_failure_stack_trace.txt
[x] dmsg log
journald_v1.99.5-28-g60b4452_IPL_failure_dmsg.txt
The base issue https://github.com/openbmc/openbmc/issues/563
Problem : The NCSI crashed and the BMC connection went offline from network.
We also see the serial console hung eventually while running test which may not be related to this.. Was able to get back online via PDU power cycle but its consistently landing into problem.
From Serial console:
barreleye login: ftgmac100 1e660000.ethernet eth0: NCSI interface up random: nonblocking pool is initialized ------------[ cut here ]------------ WARNING: CPU: 0 PID: 4 at /var/lib/jenkins/workspace/openbmc-build-gerrit-trigger-multi/distro/ubuntu/target/barreleye/openbmc/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88 CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1 Hardware name: ASpeed SoC Workqueue: events ncsi_dev_work from
from
from
from
from
from
from
from
from
from
---[ end trace 2f3901588ebb5c52 ]---
ftgmac100 1e660000.ethernet eth0: NCSI interface up
Another post PDU cycle log
root@barreleye:~# ------------[ cut here ]------------ WARNING: CPU: 0 PID: 742 at /var/lib/jenkins/workspace/openbmc-build-gerrit-trigger-multi/distro/ubuntu/target/barreleye/openbmc/build/tmp/work-shared/barreleye/kernel-source/net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x44/0x88 CPU: 0 PID: 742 Comm: kworker/0:5 Not tainted 4.7.1-6d93f625315a1e9b51d017a78f8c0938a097ea85 #1 Hardware name: ASpeed SoC Workqueue: events ncsi_dev_work from
from
from
from
from
from
from
from
from
from
---[ end trace 98a124f244ab4360 ]---
ftgmac100 1e660000.ethernet eth0: NCSI interface up