openbmc / phosphor-networkd

Apache License 2.0
18 stars 48 forks source link

Network Service become unavailable during AC-Cycle stress test #18

Closed johnhcwang closed 8 years ago

johnhcwang commented 8 years ago

I'm not familiar to network initialization, so attach to the journal log and capture some strange messages for your information. Welcome any suggestion that can help me to debug this issue.

When this issue occurred, there're some error messages about networkd as below. lan_fail.txt

Jun 28 06:30:00 barreleye netman_watch_dns[982]: Error opening[/run/systemd/netif/state]
Jun 28 06:30:00 barreleye netman_watch_dns[982]: Error processing inotify event
Jun 28 06:30:01 barreleye systemd-networkd[968]: Enumeration completed
Jun 28 06:30:01 barreleye inarp[970]: updating interface: eth0, [52:a6:f6:0e:6c:02]
Jun 28 06:30:01 barreleye systemd[1]: Started Network Service.
Jun 28 06:30:01 barreleye systemd[1]: Reached target Network.
Jun 28 06:30:01 barreleye systemd-networkd[968]: eth0: Gained carrier
Jun 28 06:30:01 barreleye systemd-networkd[968]: eth0: Configured
Jun 28 06:30:01 barreleye systemd-timesyncd[929]: Network configuration changed, trying to establish connection.

And then I tried to restart eth0 device, but got the SIOCSIFFLAGS error. lan_fail_1.txt

root@barreleye:~# ifconfig eth0 up
ifconfig: SIOCSIFFLAGS: No such device or address
root@barreleye:~# ifconfig
lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:21126 errors:0 dropped:0 overruns:0 frame:0
          TX packets:21126 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1 
          RX bytes:1605792 (1.5 MiB)  TX bytes:1605792 (1.5 MiB)

The journal log shows that it can't bring up this interface.

root@barreleye:~# journalctl -f
Jun 28 08:42:07 barreleye systemd[1]: Starting Network Service...
Jun 28 08:42:08 barreleye systemd-networkd[1086]: Enumeration completed
Jun 28 08:42:08 barreleye systemd[1]: Started Network Service.
Jun 28 08:42:08 barreleye systemd-networkd[1086]: eth0: eth0            : could not bring up interface: No such device or address
Jun 28 08:42:08 barreleye systemd-networkd[1086]: eth0: eth0            : could not set route: Network is unreachable
Jun 28 08:42:08 barreleye systemd-networkd[1086]: eth0: Configured

^C
shenki commented 8 years ago

Can you please paste the output of dmesg |grep ftgmac100?

johnhcwang commented 8 years ago
dmesg |grep ftgmac100
ftgmac100 1e660000.ethernet: Using NCSI interface
ftgmac100 1e660000.ethernet: Read MAC address from chip 52:a6:f6:0e:6c:02
ftgmac100: NCSI interface down
shenki commented 8 years ago

I think you are hitting a race condition in the network driver. This has since been fixed in newer versions of the kernel.

I can provide a backport of the fix.

johnhcwang commented 8 years ago

That's a good news. I'll pick your fix for verifying again. Thanks.

johnhcwang commented 8 years ago

Hi @shenki , I saw that you tag openbmc-4.4-20160722-1 and move to the stable kernel 4.4.15 on v1.0-stable branch, does it include the fix for this issue?

shenki commented 8 years ago

No, I did that before you opened this issue.

I made a proposed fix today but it was incorrect. Will try again tomorrow.

anoo1 commented 8 years ago

hi @gwshan, I believe you're sending the correct fix to Joel, could that be done this week? Thanks!

gwshan commented 8 years ago

Yeah, I'm working on this and will post the fix ASAP. I guess it's the issue reported from dev-4.4 as Joel mentioned to me in IRC.

gwshan commented 8 years ago

The fix requested by Joel sent to openbmc maillist, awaiting for Joel's review & pickup.

williamspatrick commented 8 years ago

Resolved with openbmc/openbmc@d470b1f154b11b86d23438eff1f58d7f592c6d2a

johnhcwang commented 8 years ago

Hi @shenki , I still get this issue on latest obmc v1.0.4 and can't restart the eth0 with the SIOCSIFFLAGS error. Does it mean that I still get the race condition because the result of dmesg is the same as previous message?

root@barreleye:~# dmesg |grep ftgmac100
ftgmac100 1e660000.ethernet: Using NCSI interface
ftgmac100 1e660000.ethernet: Read MAC address from chip 32:c6:fa:0d:e6:ae
ftgmac100: NCSI interface down
root@barreleye:~# obmcutil state
  = HOST_BOOTED
root@barreleye:~# uname -a
Linux barreleye 4.4.16-openbmc-4.4-20160804-1 #1 Tue Aug 9 17:30:01 CST 2016 armv5tejl GNU/Linux
anoo1 commented 8 years ago

Hi @johnhcwang , please open a new issue in openbmc/openbmc and add a reference to this issue. Thanks.

gwshan commented 8 years ago

Hi @johnhcwang, please let me know the best way (IRC/timezone/email etc) to contact you so that I can understand more about the issue, thanks! Currently, I cannot access a barreleye.

gwshan commented 8 years ago

I had a talk with @johnhcwang. It's likely the story: No NCSI channels are probed from BCM5719 when bringing up the network interface for the first time. As the NCSI channel enumeration is done for once, the network interface doesn't have workable NCSI device assocated, meaning the network interface won't work later on.

I will provide tentative patch for @johnhcwang to try as discussed, thanks!