Closed qiuzi closed 8 years ago
I can confirm that, not only for 5 GHz. It looks definitely strange - nothing in log, connection is up (client is connected with AP), but transmission between the router and client just dies - can't even ping both sides. After some time or reconnect, everything is working again.
Confirmed on SAP-G3200U3.
Is there a way to debug this driver?
Cheers, Piotr
I found that the tramsmission between my router (with MT7612) and iPhones breaks intermittently. However, the connection between my router and my laptop (Atheros WNIC) is quite stable. What is the model of the wireless NIC of your clients?
When the tramsmission between my router and my iPhone dies, I captured 802.11 frames transmitted over the air using a third device. It showed that both ends could send and receive 802.11 frames properly during that period, and I can see that frames sent by both sides were ACK'ed by the other side. It looks like as if the iPhone is dropping all the packets on the layer 3 during the outage. I am wondering whether this is a bug of the driver or of client devices.
my nic is intel 6300AGN.
lenovo newifi(MT7620 inside)
same as pepe2k.
Absolutly. it is MT76 driver's bug
@sodz, @jiyifeng
Tested/confirmed on:
@pepe2k @jiyifeng I have just ran some tests on Lenovo newifi (mt7612e, latest mt76 driver) and two clients: a macbook with bcm43xx, and a tablet with marvell avastar WNIC. On both clients, iperf and ping tests were cariied out simultaneously for about 1 hour. It turned out the packet loss rate was neglectable, and there were no noticeable outages.
@sodz
I will make deeper tests later, at this moment I'm using drivers from MTK, without any problems.
2.4g no problem same as @sodz but 5g same as @pepe2k
@pepe2k ok, now I can confirm that with real-world traffic, the transmission dies randomly. no idea why it did not happen when i tested using iperf.
I think I have identified this issue with A-MPDU. After disabling it (by commenting out line 795 of init.c
, "/* ieee80211_hw_set(hw, AMPDU_AGGREGATION); */
" the communication becomes stable now, although at a performance loss. @pepe2k could you please try this workaround out and see if you could confirm that this is related to AMPDU?
Here are some packets I captured, just when the transmission died: https://www.cloudshark.org/captures/93482862e765 Apparently there are anomalies, but I am not familiar with 802.11 MAC and I don't know what actually went wrong.
please try the latest version
@nbd168 just tried. but it didn't fix this issue.
@sodz Problem solved?
Thanks for your suggestion, @sodz; I think it's indeed related to frame aggregation. At least, after commenting https://github.com/openwrt/mt76/blob/17c5b83ee789639605f8a38c38efe8530ccf6b30/init.c#L813, clients don't have intermittent timeouts when connecting to WAN hosts. The odd part is that LAN connections seemed fine, and all WAN connections were OK on the router, so for a long time I assumed it's some sort of bridging problem. Then again, the 2.4 GHz link, using a different driver/radio was always fine…
Another observation is that these timeouts get much worse when increasing channel width (VHT20->40->80), and are more correlated to throughput and amount of data transferred. Maybe these increase the probability of some sort of buggy frame aggregation event. Either way, things are OK after disabling this feature, and performance doesn't seem to have taken a significant hit.
Update: performance does take a major hit with 11ac/mobile clients (maybe because all 11ac frames are supposed to be MPDUs?). I used to be able to saturate a 30 Mbps connection, versus 11-12 now… @LorenzoBianconi, does this happen to you as well?
By the way, @nbd168, I noticed that MAX-A-MPDU-LEN-EXP
is always forced to zero. I'm probably reading the iw phy
output wrong, but it seems like MAX-A-MPDU-LEN-EXP3
should be supported. I'm mentioning this in case there's a conflict between mac80211.sh
and the way mt76
is reporting capabilities.
Please test if the current version still has this issue
Seems better now (0a47c463d2375f372a73fc0ce5b7538fed8fa2bd); no timeouts for roughly fifteen minutes, and bandwidth was back to normal on my Nexus 5, but then it happened again… I uploaded the package here, in case anyone else wants to test. Thanks, @LorenzoBianconi and @nbd168, for working on this!
Update: the timeouts are a lot more random now, and happen more rarely, but still very frustrating. No obvious errors on either router, or clients… I haven't done a proper git bisection, but I went as far back as July 6th (d1a6945d777d667185cb2dcddcf15fd334f80fa0), and timeouts still happen, plus much reduced bandwidth.
Update2: Same observations with 659530a511d8576a156d5338e8f3f4e201344264 (updated package here). To reiterate, local connectivity (existing links, ping, etc) is maintained, but WAN stops working for a few minutes, with no obvious pattern. All goes back to normal when ieee80211_hw_set(hw, AMPDU_AGGREGATION)
is disabled. I wonder whether compat-wireless-2015-07-21
has anything to do with it… Also, pings are very consistent on the router, but very erratic over Wi-Fi. Testing compat-wireless-2015-08-03
now.
Problem still not solved
Hey @nbd168, just a crazy idea… Since NAT/TCP seems to be involved somehow, do you think GRO or the generic segmentation offloading might cause this issue with large MPDUs? I'll try to play with ethtool
, since nothing else worked so far :-(
I don't think this has anything to do with GRO or similar things, because this is all abstracted away by the network stack. Please try the latest version (committed in OpenWrt trunk r47063), I found some more aggregation related bugs
Thanks again for your tireless efforts, @nbd168. Unfortunately, I'm the bearer of bad news yet again. Things have actually gotten worse after 9e972d5; now, even moderate network loads trigger timeouts. They happen more quickly, and recovery takes longer, or doesn't happen at all. I have a few HT clients, and one VHT (Nexus 5). As before, disabling A-MPDU will fix the issue, but then everything slows down a lot (5-6 Mbps). Here are a few things I noticed so far:
htmode
(HT40, VHT40, or VHT80).I even worried about segmentation/offloading, MTUs (MSS clamping), etc, but it can't be those as you pointed out (the mt7602
radio never has this issue, after all). I wish I knew more about mac80211
and the network stack, but I'm glued to any development here ;-)
Can you use another device to capture all packets in monitor mode before and during the hangs? If so, please make the AP run in HT20 mode to ensure that the monitor mode capture is as reliable as possible.
On VHT20, it takes slightly longer for the connection to break (probably, because it's slower), but here's the raw capture after things go haywire. I also uploaded the file on CloudShark here, if it helps.
The LG STA is a Nexus 5 (supposedly, single stream 11ac). I don't know much about this, but lots of fragmentation errors, malformed packets, etc happening. All in all, not good things…
What kind of device did you use to capture? Also, can you please do another capture in HT20 (not VHT20) mode? That should make capturing data packets (which I need) more reliable.
Data were collected with the builtin card in my Macbook Air (BCM43xx in sniffing/monitor mode). I don't think I have a better setup readily available. At any rate, I switched to HT20 (channel 44), and uploaded more PCAPs in that Box folder.
good
file captures the short period when things seem to work (also here on CloudShark).bad1
file was captured after the link broke following a speed test (also here). Apparently, on HT20, the timeouts seem to recover pretty quickly, and towards the end of the capture, simple browsing started working, albeit not very well.bad2
file was captured after the link recovered, and I decided to stress it with another speed test, when it breaks again (also here). I was a bit hasty with my previous comment; those damaged packets were neighborhood noise, and for the purpose of these tests, my only active STA on that channel is the LG Nexus 5 (BCM4339). As you can see, I'm very keen on fixing this ;-) and much appreciate your work.
Progress, I think ;-) I was trying to make sense of the information in those captured packets, and based on my limited understanding, it sounded like power saving might be involved. No luck with UAPSD and WMM as possible culprits, but I noticed the not-so-benign changes in bca9b7c. Since my issues got worse as more BARs were sent, I reverted the relevant commits, and things seem OK now.
Please try the latest git version without the BAR related reverts. It's good to know that the BAR frames are triggering the issue, but I still need to understand why.
It would be really great to understand why, especially that other bugs may lurk, either in this driver, or in the SoC stuff. I always revert everything with your newest commits. Testing https://github.com/openwrt/mt76/commit/d4900fc37f44cd52c4bf1474f0b48a0336ec4e22 on top of https://github.com/openwrt-mirror/openwrt/commit/73edad2df6a018e92420ef4872e1bbd7b9d9d4bb yielded the same timeouts, but I just noticed a couple of interesting RX buffer fixes (e.g., https://github.com/openwrt-mirror/openwrt/commit/73edad2df6a018e92420ef4872e1bbd7b9d9d4bb, https://github.com/openwrt-mirror/openwrt/commit/966bec6badc13cd30bc40298f0e27c68cbd1adb0). I'm currently testing your latest changes on top of https://github.com/openwrt-mirror/openwrt/commit/a6900bd38ef62d8fbf974a8b4be6b11c8b7a73cb.
What still baffles me most is why are these timeouts so much worse when mt7620
does normal routing, versus simple bridging behind another WAN router… Fix https://github.com/openwrt-mirror/openwrt/commit/118b7111459bbe81a9947e82a03cbd47fe1396a9 for mt7621
is intriguing; do you think we have similar issues with mt7620
? Also, would @blogic be able to chime in on this very stubborn issue?
P.S. On latest everything, my one 11ac client seems to behave better, but the other 11n clients still suffer periodic timeouts.
Found another bug that would mess up BAR transmissions. Please try trunk r47142 with latest mt76
Seems, dare I say, fixed ;-) I think your latest mac80211
patch (https://github.com/openwrt-mirror/openwrt/commit/bdeb1661d7d4780a7d8a7a9747827beca42b0069) was the keystone to all this craziness, maybe? Either way, do you think it'd be a good idea to consolidate some of the aggregation-related flags? For example, a valid mtxq->agg_ssn
implies mtxq->aggr
true, or maybe I'm misunderstanding some of these.
Otherwise, just curious, are we doing a lot more in software than other mac80211
drivers, so we need to be more careful about tracking and sending BARs when aggregating frames?
I don't see a good way to consolidate the flags. agg_ssn just tracks the last used sequence number during an aggregation session (we could store it outside of aggregation as well, but it's not needed then). We can't tell from the value whether it is valid or not (0 is a valid value as well), so we can't easily get rid of mtxq->aggr.
In terms of doing things in software vs hardware, there are two main classes of devices: those having aggregation handling in software and those having it in hardware. With ath9k, the software controls everything related to aggregation: sequence number assignment, forming aggregates, selecting rate retry table for each full aggregate. With ath10k, iwlwifi, etc. the firmware handles all these things, the software only does the protocol handshake.
mt76 is somewhere inbetween. Sequence number assignment is handled in software, aggregating frames together into A-MPDUs is handled in hardware. This hardware design is actually not very pleasant to deal with, because it makes it necessary for software to deal with all kinds of stuff, yet it does not give the software enough control to do it well. The driver does not get a reliable tx status or aggregation feedback, so it cannot know which frames exactly a client received and what its receiver aggregation window looks like. Because of this, the driver needs to do stupid things like send BAR frames on station PS wakeups.
In terms of driver complexity, mt76 is a lot simpler than pretty much any comparable driver that allows for software aggregation control. This is made possible by two things:
Either way, I'll mark this ticket as fixed now. Thanks a lot for testing! Feel free to reopen if issues re-occur.
Thanks so much, @nbd168, not only for all the good work, but also for taking the time to explain how things work. This is great information!
Hi @nbd168, thanks for your great work. However, I find mt76 is still not stable for some client devices (i.e., iPhone 6 Plus).
I am using Lenovo Y1. I have 3 client devices that support 802.11ac. mt76 works very well on my PC (with Intel 7260 AC) and Tablet (Nexus 9). However, it does not work well on the iPhone 6 Plus (with iOS 9.0.2). Sometimes suddenly all transmissions timeout, while the wifi is still shown as connected on both the router and the iPhone. Manually turn off / turn on wifi on iPhone will make work again. The link can be also recovered automatically after a few minutes. Please tell me if you need any additional information
FYI, I am using Openwrt Chaos Calmer 15.05 stable, with the latest version of mt76 0169cab. To make mt76 work in Chaos Calmer, I've reverted d1a6945 and removed the IEEE80211_HW_SUPPORT_FAST_XMIT flag in init.c.
@zb87, the latest fix that I made was in mac80211, not mt76 directly. I have already pushed the relevant fix into the Chaos Calmer Branch and updated mt76 to the latest version there. I have also pushed a hostapd fix that might help with stability on iOS devices. Please try the latest version of the branch as-is to see if it's more stable for you.
I've tried the new Chaos Calmer branch, everything is working well. Thanks for your nice work @nbd168.
Use mt76 turn 5G signals will break down automatically every 5 minutes.