jayanta525 / openwrt-nanopi-r2s

OpenWrt support for FriendlyElec NanoPi R2S RK3328 SoC board with 2x1000Mbps ports. This repository is not a fork of friendlywrt but a fork of upstream/openwrt.
https://openwrt.org
GNU General Public License v2.0
84 stars 37 forks source link

Intermittent ethernet disconnections #5

Closed carloscm closed 4 years ago

carloscm commented 4 years ago

I've been testing this build for a week now: https://github.com/jayanta525/openwrt-nanopi-r2s/releases/download/v1.2/openwrt-rockchip-armv8-friendlyelec_nanopi-r2-rev00-ext4-sysupgrade.img.gz

I'm finding that the ethernet ports occasionally (1-2 times/day) disconnect and only a reset of the device fixes the problem. There is no load or traffic on the device, just a ssh session with htop. Nothing is printed in the syslog, they just stop accepting new connections or pings. The device remains running and it is not hung, I've verified it by making the system log persisting between reboots.

For eth0 (SOC port) I've changed the cable and it appears to be working now. In any case I would like to point out to this patch from FriendlyWrt: https://github.com/friendlyarm/friendlywrt/commit/b02f11d1db4f5c5a8fa424a09738bec71bdff32a Apparently some offloading features in the SOC are buggy and they are disabling them.

For eth1 (USB port) I've also found references (for RTL8153, the USB Ethernet chip) to the fact it can disconnect in the event of some USB power management commands. I will try some workarounds and monitor the situation, but it is impossible to trigger on purpose, it requires days of passive testing.

I'm making this issue on the hopes some other users are also testing the R2S with this openwrt build and we can share more information together if they have the same issues. If it's not appropriate I can delete it, thanks.

carloscm commented 4 years ago

I'm going to run a startup script that does this:

    echo "Disable offloading for built-in ethernet"
    /usr/sbin/ethtool -K eth0 rx off tx off
    echo "Disable USB autosuspend"
    echo -1 >/sys/module/usbcore/parameters/autosuspend

And comment again in a few days if I stop having problems. So far it has stopped, but it's not enough time to tell, yet.

carloscm commented 4 years ago

Link disconnection happened again on eth1, even with USB autosuspend disabled. Here's the log, obtained after a reboot:

Wed Jun 17 07:07:08 2020 authpriv.info dropbear[3578]: Child connection from 192.168.2.220:58508
Wed Jun 17 07:07:08 2020 authpriv.notice dropbear[3578]: Auth succeeded with blank password for 'root' from 192.168.2.220:58508
Wed Jun 17 10:14:14 2020 authpriv.info dropbear[3578]: Exit (root) from <192.168.2.220:58508>: Keepalive timeout
Wed Jun 17 10:33:05 2020 daemon.notice netifd: Network device 'eth1' link is down
Wed Jun 17 10:33:05 2020 kern.info kernel: [13481.818355] br-lan: port 1(eth1) entered disabled state
Wed Jun 17 10:33:05 2020 kern.info kernel: [13481.821280] r8152 5-1:1.0 eth1: carrier off
Wed Jun 17 10:33:06 2020 daemon.notice netifd: bridge 'br-lan' link is down
Wed Jun 17 10:33:06 2020 daemon.notice netifd: Interface 'lan' has link connectivity loss
Wed Jun 17 10:33:14 2020 kern.notice kernel: [13490.394943] r8152 5-1:1.0 eth1: Promiscuous mode enabled
Wed Jun 17 10:33:14 2020 kern.info kernel: [13490.395729] r8152 5-1:1.0 eth1: carrier on
Wed Jun 17 10:33:14 2020 daemon.notice netifd: Network device 'eth1' link is up
Wed Jun 17 10:33:14 2020 daemon.notice netifd: bridge 'br-lan' link is up
Wed Jun 17 10:33:14 2020 daemon.notice netifd: Interface 'lan' has link connectivity
Wed Jun 17 10:33:14 2020 kern.info kernel: [13490.397083] br-lan: port 1(eth1) entered blocking state
Wed Jun 17 10:33:14 2020 kern.info kernel: [13490.397591] br-lan: port 1(eth1) entered forwarding state

There's was nothing going on, just an idle ssh session, with no traffic or load. At "Wed Jun 17 10:14:14 2020" the ssh session timeouts, and 19m later eth1 is marked as down. It goes up again 10s later, but even if it says so, it was still impossible to connect or even ping it at its IP. Windows detected a network link but said it was "unidentified", which means it cannot get a DHCP reply. This could only be fixed by power cycling the R2S. I will continue trying other things.

jayanta525 commented 4 years ago

I received the NanoPi R2S board just yesterday. I am a bit busy with my semester exams, i will look into this issues ASAP.

Wed Jun 17 10:33:05 2020 kern.info kernel: [13481.818355] br-lan: port 1(eth1) entered disabled state
Wed Jun 17 10:33:05 2020 kern.info kernel: [13481.821280] r8152 5-1:1.0 eth1: carrier off

Seems like kernel disabled eth1.

Could you check if this issue exists with the friendlywrt build?

carloscm commented 4 years ago

Testing FriendlyWrt is in my plans, but I'm taking it slow for now (running stock openwrt is my ultimate goal). In the meantime I'm already adapting the fixes I can find in their repo to my own setup script, for example this is their ethernet offloading bugfix for eth0: https://github.com/friendlyarm/friendlywrt/commit/b02f11d1db4f5c5a8fa424a09738bec71bdff32a

carloscm commented 4 years ago

For the first time since I received the R2S a week ago, it has managed to remain connected for more than 24h on both ports.

For eth0, changing a cable and applying the fix mentioned in the previous message fixed it for me a few days ago.

For eth1, I tried to look at the FriendlyWrt repository and also in another fork I found (https://github.com/klever1988/nanopi-openwrt), but I couldn't find any references for patches or fixes about eth1 and the RTL8152/3. I'm not familiar with openwrt anyway, so I probably missed them, if they exist.

I am now trying this advice: https://www.raspberrypi.org/forums/viewtopic.php?t=242964 By running this command on startup: ethtool -s eth1 speed 1000 duplex full I understand this is not a general fix, since it hardcodes a link speed, making it unusable for 100Mbps devices. It is also possible the USB Ethernet chip is fine and this is a problem in my local network. In any case, as I mentioned, this is the first time eth1 has stayed up for more than 24h for me.

I will now start running long duration tests with actual traffic, instead of just idle connections.

jayanta525 commented 4 years ago

I am unable to produce this issue with my setup.

Both eth0 and eth1 is connected to a TP-Link TL-SG108 Switch in my case. Flow control is off on both the interfaces.

I have forced pushed some changes to the branch recently. The work is on hold now, and the next commit should include major fixes and rebased to latest openwrt. This might probably fix the issue you're having.

kzipkzip commented 4 years ago

I'm running the currently release 1.2 with no issue here. Uptime 24 hours with no disconnection problems. Had a small issue with dnsmasq not issuing dhcp addresses when first booted, restarting the service fixed the issue. May have been a time sync problem? NTP not installed by default so device thought it was 2016, dhcp leases time senstive so maybe? Installed NTP and been fine since anyway.

carloscm commented 4 years ago

I have forced pushed some changes to the branch recently. The work is on hold now, and the next commit should include major fixes and rebased to latest openwrt. This might probably fix the issue you're having.

I haven't add any issues since my latest test, but as I said, I now suspect it's just my local setup. I will keep installing new snapshots and testing them, thanks for your work!