Open trunneml opened 7 months ago
There is comment that some corrupted Ethernet Frames could freeze RX DMA on that chip: https://github.com/openwrt/openwrt/issues/13198#issuecomment-1777269938
In the same discussion - other suspected trigger is when other STA disconnects and iphone(s) freeze.
In the same discussion - other suspected trigger is when other STA disconnects and iphone(s) freeze.
No other STA disconnected or joined, when that error occurred the last two times.
Short feedback. My Banana Pi R3 is running 5G in AC mode now for about 4 days and no disconnection of my iphones. I think the bug is somewhere in the AX Mode. Can someone tell me how to debug that stuff?
Are you able to reproduce this issue on OpenWrt main SNAPSHOT?
I have the same behavior with 2 iPhones and a GL-MT6000 (MT7986AV) about every 3-4 days (then sometimes also 4-5x per day). I cannot ping anything on the network from the phones and cannot ping the phones either. But they are still connected to the wifi. Disable/Enable wifi on the phones solves the issue temporarily.
I have configured the router to use AX@80MHz and I am using the latest snapshot with kernel 6.6 but had the issue for months now even with older snapshots using kernel 5.
Unfortunately I cannot yet reliably reproduce it. Does anyone have an idea what I can do?
As my iPhones are the only AX devices I own, I just turned of AX in OpenWRT. So the 5GHz Channel is running in AC mode. Since then no more problems.
GL-MT6000 (MT7986AV) about every 3-4 days (then sometimes also 4-5x per day). I cannot ping anything on the network from the phones and cannot ping the phones either. But they are still connected to the wifi. Disable/Enable wifi on the phones solves the issue temporarily.
I've noticed this behavior as well with 2 out of my three GL-MT6000s. This, despite my otherwise optimistic sounding post about stability here.
It is a very strange issue in that it only affects an STA or two at any time. Further, it seems to be only one MT6000 + a "problem" STA. If I force the stalled STA off the borked MT6000 to where the STA associates to a neighboring MT6000, the STA will pick up and begin communication again. But if it roams back to the borked MT6000, the STA just goes "cold"--stays associated but gives no sign of two-way data communication.
Given the challenge of trying to isolate the what and when around this issue, I've had a hard time articulating it enough to even consider opening an issue. But I'm glad to see this issue has been opened here and will be following intently now.
If I can provide any other supporting info, or test any patches, I'm very willing to help. Thanks!
What mt76 firmware versions are you all running on your devices?
At the present time for me:
# dmesg | grep Firmware
[ 11.707186] platform 15010000.wed: MTK WED WO Firmware Version: DEV_000000, Build Time: 20240507160523
[ 12.355218] mt798x-wmac 18000000.wifi: WM Firmware Version: ____000000, Build Time: 20240507160318
[ 12.439591] mt798x-wmac 18000000.wifi: WA Firmware Version: DEV_000000, Build Time: 20240507160509
I'm on 2023.05-Snapshot:
root@OpenWrt:~# dmesg | grep Firmware
[ 18.233083] mt798x-wmac 18000000.wifi: WM Firmware Version: ____000000, Build Time: 20221012174805
[ 18.480859] mt798x-wmac 18000000.wifi: WA Firmware Version: DEV_000000, Build Time: 2022101217493
I am on snapshot r26399-17ca4cccc6
# dmesg | grep Firmware
[ 12.149544] mt798x-wmac 18000000.wifi: WM Firmware Version: ____000000, Build Time: 20221012174725
[ 12.231740] mt798x-wmac 18000000.wifi: WA Firmware Version: DEV_000000, Build Time: 20221012174937
I'm having something similar with a pair of TUF-AX4200 connected via mesh and I have some clients that suffer from the same issue while other seem to be working fine.
Possibly the same issue appears occasionally between routers - the mesh is up but no traffic goes trough and tcpdump
shows that ping packets from one of them is received by the second one and it does reply to it but the first one doesn't receive the reply. Luckily, I noticed that mesh becomes broken immediately if I do iw scan
but I simply haven't figured out what's actually broken let alone how to fix it. I even tried custom builds from https://github.com/pesa1234/mt76/ with different firmwares/patches adapted/taken from https://git01.mediatek.com/plugins/gitiles/openwrt/feeds/mtk-openwrt-feeds/ but the issue persists to the point that I reverted back to TP-Link C2600 so I can test the issue without causing internet outages in my home.
Can anyone with mesh setup try iw scan
on mesh interface and see if it breaks for them? I suppose that it's the same issue we just can trigger at will so it should be easier to figure out what's going on.
@filippz I Think I'm seeing something similar on my Cudy WR3000V1 devices ( MT7981 ) Mesh is configured and runs fine for about 10 minutes before mesh peers become unreachable ( right after I get "nl80211: wpa_driver_nl80211_event_receive->nl_recvmsgs failed: -5" on the mesh peers. The device with the ethernet to mesh bridge seems to keep working, and I can see DHCP requests from Mesh peers on my DHCP server, but teh replies do not seem to reach them ) a "wifi up radio1" lets it work again for a few minutes
wireless.wmesh5=wifi-iface
wireless.wmesh5.device='radio1'
wireless.wmesh5.mode='mesh'
wireless.wmesh5.mesh_id='M5-MESH'
wireless.wmesh5.encryption='sae'
wireless.wmesh5.key='meshmeshmesh'
wireless.wmesh5.network='mesh5'
wireless.wmesh5.mesh_fwding='0'
wireless.wmesh5.mesh_rssi_threshold='0'
mesh fwding is off because BATMAN-ADV will handle routing for me.. The same system config works fine when I run it on a TP-Link EAP225 outdoor ( ath10k instead of MT76)
I wonder if whatever bug @PolynomialDivision was experiencing here is also related.. https://forum.openwrt.org/t/mt76-wireless-driver-debugging/154514/51?u=ddk
@filippz I Think I'm seeing something similar on my Cudy WR3000V1 devices ( MT7981 ) Mesh is configured and runs fine for about 10 minutes before mesh peers become unreachable ( right after I get "nl80211: wpa_driver_nl80211_event_receive->nl_recvmsgs failed: -5" on the mesh peers.
Does it stop working immediately if you do iw scan
on either side? For me mesh would work at least for some hours and sometimes days. I tried generating more traffic with iperf3
but I couldn't trigger the issue. I still believe that it breaks the more traffic it gets but not that quickly.
The device with the ethernet to mesh bridge seems to keep working, and I can see DHCP requests from Mesh peers on my DHCP server, but teh replies do not seem to reach them
I tried using arping
and noticed that I can get a reply to a broadcast (most of the time - not always) but not unicast (https://github.com/openwrt/openwrt/issues/13880#issuecomment-2028800436) You can try arping
/ arping -b
to see how it behaves for you.
Hi @filippz .. you are correct : I have a deamon that runs a scan every 10 minutes ( ubus call iwinfo scan {\"device\":\"$phynum\" } ) as part of a periodic survey on the system...
disabling it and running my tests again now
...a scan every 10 minutes ( ubus call iwinfo scan {"device":"$phynum" } ) ...
I tried building master with Mediatek feed and scan also breaks the mesh. I still can't say if the loosing network after some time is actually the same bug we just trigger differently (with scan on the mesh) or this is another bug altogether.
after removing my "survey deamon" i have 7 CUDYWR300V1 ( MT7981 ) devices that have been up for the entire weekend without issue ( Mesh on 2.4Ghz and 5 Ghz ) with AP on 2.4Ghz ( Vif ) so it sees that scanning breaks something
@DanielRIOT - is your mesh now stable over longer periods?
Hi @filippz , yes it is, the mesh ( 802.11S / mesh point ) is stable as long as I do not do any wifi scans ( iw scan or ubus calls to iwinfo scan or the like ) I have updated my built to Master last Thursday and the behavior is the same as before.
Maybe you could try adding these patches: https://github.com/lukasz1992/openwrt/blob/v23.05.4-lukasz1992/package/kernel/mt76/patches/104-mt7915-add-missing-flush.patch https://github.com/lukasz1992/openwrt/blob/v23.05.4-lukasz1992/package/kernel/mt76/patches/131-partially-move-channel-change-code-to-core.patch https://github.com/lukasz1992/openwrt/blob/v23.05.4-lukasz1992/package/kernel/mt76/patches/132-add-separate-tx-scheduling-queue-for-off-channel-tx.patch
compile and check if it helps?
@lukasz1992 Thanks for taking a look - sadly the issue persists.
I've used official OpenWRT repo, v23.05.4
tag, added your patches to package/kernel/mt76/patches
and while building I saw Applying .../package/kernel/mt76/patches/104-mt7915-add-missing-flush.patch using plaintext
lines so I guess patches have been applied successfully. For me broken scan is not a deal breaker but mesh for me fails with the same symptoms after a while so I'm assuming that scan is just another way of triggering the issue.
And this? cmonroe/feed-wifi-master@5fdb411
I tried to build it, but with it fails to apply - by manually looking at code I'd say that in v23.05.4 kernel+backports || local->scanning
was not present so this patch would not help. In any case seems to me that the changed code is not mt76 specific so we would see the same issue on other drivers as well.
It seems like I found a workaround that works for me. I have installed the custom build of pesa1234 (based on r27280-aa1c1b6e29) on my MT6000 and for a bit more that a week i had no issues anymore. Before this I had the issues with 2 iPhones 3-5x per day. https://github.com/pesa1234/MT6000_cust_build https://forum.openwrt.org/t/mt6000-custom-build-with-luci-and-some-optimization-kernel-6-6-x/185241
I am not sure what exactly in this build solves the issue and I would be happy to have it solved in a clean snapshot build as well. Maybe someone with more knowledge can figure this out?
Hi @filippz , yes it is, the mesh ( 802.11S / mesh point ) is stable as long as I do not do any wifi scans ( iw scan or ubus calls to iwinfo scan or the like ) I have updated my built to Master last Thursday and the behavior is the same as before.
This commit fixes WiFi scan on mesh.
Describe the bug
On my Banana Pi R3 running 23.05.2 my two iPhones are loosing network access after some time. This occurs multiple times on a day, when the phone is in standby. They are the only AX clients in my wifi. All other clients/STA are AC and working fine.
The wifi the connection is shown on the router and the phone and signal is still good. But the phone can't access any website or luci (100% package loose) and is not pingable. Turning Wifi off and on helps and it is working again for some hours.
I tried
But it didn't help. Syslog doesn't show anything special. See: https://github.com/openwrt/openwrt/issues/14824#issuecomment-2003037670
When that issue occurs, the iPhone with the connection problem no longer has HE-MCS and HE-NSS attributes in LUCI.
Running
tcpdump
when that error occurs shows that the AP still receives packages from the iPhone and answers them (for examle DNS Request) but it seems that the iPhone doesn't receives them:Maybe this bug is related to: https://github.com/openwrt/mt76/issues/792
OpenWrt version
r23630-842932a63d and r23806-03a3a729ec
OpenWrt release
23.05.2 and 23.05-SNAPSHOT
OpenWrt target/subtarget
mediatek/filogic
Device
Bananapi BPI-R3 (MT7981/MT7975P)