Broken 5G signal connections

qiuzi commented 9 years ago

Use mt76 turn 5G signals will break down automatically every 5 minutes.

pepe2k commented 9 years ago

I can confirm that, not only for 5 GHz. It looks definitely strange - nothing in log, connection is up (client is connected with AP), but transmission between the router and client just dies - can't even ping both sides. After some time or reconnect, everything is working again.

Confirmed on SAP-G3200U3.

Is there a way to debug this driver?

Cheers, Piotr

sodz commented 9 years ago

I found that the tramsmission between my router (with MT7612) and iPhones breaks intermittently. However, the connection between my router and my laptop (Atheros WNIC) is quite stable. What is the model of the wireless NIC of your clients?

When the tramsmission between my router and my iPhone dies, I captured 802.11 frames transmitted over the air using a third device. It showed that both ends could send and receive 802.11 frames properly during that period, and I can see that frames sent by both sides were ACK'ed by the other side. It looks like as if the iPhone is dropping all the packets on the layer 3 during the outage. I am wondering whether this is a bug of the driver or of client devices.

jiyifeng commented 9 years ago

my nic is intel 6300AGN. lenovo newifi(MT7620 inside) same as pepe2k.
Absolutly. it is MT76 driver's bug

pepe2k commented 9 years ago

@sodz, @jiyifeng

Tested/confirmed on:

Galaxy S5
Broadcom BCM94360CD
Intel Dual Band Wireless-AC 7260
TP-Link TL-WDN4800

sodz commented 9 years ago

@pepe2k @jiyifeng I have just ran some tests on Lenovo newifi (mt7612e, latest mt76 driver) and two clients: a macbook with bcm43xx, and a tablet with marvell avastar WNIC. On both clients, iperf and ping tests were cariied out simultaneously for about 1 hour. It turned out the packet loss rate was neglectable, and there were no noticeable outages.

pepe2k commented 9 years ago

@sodz

I will make deeper tests later, at this moment I'm using drivers from MTK, without any problems.

qiuzi commented 9 years ago

2.4g no problem same as @sodz but 5g same as @pepe2k

sodz commented 9 years ago

@pepe2k ok, now I can confirm that with real-world traffic, the transmission dies randomly. no idea why it did not happen when i tested using iperf.

sodz commented 9 years ago

I think I have identified this issue with A-MPDU. After disabling it (by commenting out line 795 of init.c, "/* ieee80211_hw_set(hw, AMPDU_AGGREGATION); */" the communication becomes stable now, although at a performance loss. @pepe2k could you please try this workaround out and see if you could confirm that this is related to AMPDU?

sodz commented 9 years ago

Here are some packets I captured, just when the transmission died: https://www.cloudshark.org/captures/93482862e765 Apparently there are anomalies, but I am not familiar with 802.11 MAC and I don't know what actually went wrong.

nbd168 commented 9 years ago

please try the latest version

sodz commented 9 years ago

@nbd168 just tried. but it didn't fix this issue.

qiuzi commented 9 years ago

@sodz Problem solved?

airend commented 9 years ago

Thanks for your suggestion, @sodz; I think it's indeed related to frame aggregation. At least, after commenting https://github.com/openwrt/mt76/blob/17c5b83ee789639605f8a38c38efe8530ccf6b30/init.c#L813, clients don't have intermittent timeouts when connecting to WAN hosts. The odd part is that LAN connections seemed fine, and all WAN connections were OK on the router, so for a long time I assumed it's some sort of bridging problem. Then again, the 2.4 GHz link, using a different driver/radio was always fine…

Another observation is that these timeouts get much worse when increasing channel width (VHT20->40->80), and are more correlated to throughput and amount of data transferred. Maybe these increase the probability of some sort of buggy frame aggregation event. Either way, things are OK after disabling this feature, and performance doesn't seem to have taken a significant hit.

Update: performance does take a major hit with 11ac/mobile clients (maybe because all 11ac frames are supposed to be MPDUs?). I used to be able to saturate a 30 Mbps connection, versus 11-12 now… @LorenzoBianconi, does this happen to you as well?

By the way, @nbd168, I noticed that MAX-A-MPDU-LEN-EXP is always forced to zero. I'm probably reading the iw phy output wrong, but it seems like MAX-A-MPDU-LEN-EXP3 should be supported. I'm mentioning this in case there's a conflict between mac80211.sh and the way mt76 is reporting capabilities.

nbd168 commented 9 years ago

Please test if the current version still has this issue

airend commented 9 years ago

Seems better now (0a47c463d2375f372a73fc0ce5b7538fed8fa2bd); no timeouts for roughly fifteen minutes, and bandwidth was back to normal on my Nexus 5, but then it happened again… I uploaded the package here, in case anyone else wants to test. Thanks, @LorenzoBianconi and @nbd168, for working on this!

Update: the timeouts are a lot more random now, and happen more rarely, but still very frustrating. No obvious errors on either router, or clients… I haven't done a proper git bisection, but I went as far back as July 6th (d1a6945d777d667185cb2dcddcf15fd334f80fa0), and timeouts still happen, plus much reduced bandwidth.

Update2: Same observations with 659530a511d8576a156d5338e8f3f4e201344264 (updated package here). To reiterate, local connectivity (existing links, ping, etc) is maintained, but WAN stops working for a few minutes, with no obvious pattern. All goes back to normal when ieee80211_hw_set(hw, AMPDU_AGGREGATION) is disabled. I wonder whether compat-wireless-2015-07-21 has anything to do with it… Also, pings are very consistent on the router, but very erratic over Wi-Fi. Testing compat-wireless-2015-08-03 now.

qiuzi commented 9 years ago

Problem still not solved

airend commented 9 years ago

Hey @nbd168, just a crazy idea… Since NAT/TCP seems to be involved somehow, do you think GRO or the generic segmentation offloading might cause this issue with large MPDUs? I'll try to play with ethtool, since nothing else worked so far :-(

nbd168 commented 8 years ago

I don't think this has anything to do with GRO or similar things, because this is all abstracted away by the network stack. Please try the latest version (committed in OpenWrt trunk r47063), I found some more aggregation related bugs

airend commented 8 years ago

Thanks again for your tireless efforts, @nbd168. Unfortunately, I'm the bearer of bad news yet again. Things have actually gotten worse after 9e972d5; now, even moderate network loads trigger timeouts. They happen more quickly, and recovery takes longer, or doesn't happen at all. I have a few HT clients, and one VHT (Nexus 5). As before, disabling A-MPDU will fix the issue, but then everything slows down a lot (5-6 Mbps). Here are a few things I noticed so far:

Only WAN connections timeout, which is the weirdest part; the SSH link to the router is always OK, and no logged errors whatsoever.
Every combination of software behaves more or less the same: compat-wireless 09-16, hostapd 2.5 just released, etc (ditto latest stable/Chaos Calmer).
When disabling A-MPDU, things have gotten worse after 08-28 (b6de6a0), although that probably doesn't matter since aggregation is a core feature.
Things are much better with a dumb AP setup, behind a Linksys E1500 router. An HT40 client (Ralink) works quite well now, while the Nexus 5 still timeouts. This happens regardless of htmode (HT40, VHT40, or VHT80).

I even worried about segmentation/offloading, MTUs (MSS clamping), etc, but it can't be those as you pointed out (the mt7602 radio never has this issue, after all). I wish I knew more about mac80211 and the network stack, but I'm glued to any development here ;-)

nbd168 commented 8 years ago

Can you use another device to capture all packets in monitor mode before and during the hangs? If so, please make the AP run in HT20 mode to ensure that the monitor mode capture is as reliable as possible.

airend commented 8 years ago

On VHT20, it takes slightly longer for the connection to break (probably, because it's slower), but here's the raw capture after things go haywire. I also uploaded the file on CloudShark here, if it helps.

The LG STA is a Nexus 5 (supposedly, single stream 11ac). I don't know much about this, but lots of fragmentation errors, malformed packets, etc happening. All in all, not good things…

nbd168 commented 8 years ago

What kind of device did you use to capture? Also, can you please do another capture in HT20 (not VHT20) mode? That should make capturing data packets (which I need) more reliable.

airend commented 8 years ago

Data were collected with the builtin card in my Macbook Air (BCM43xx in sniffing/monitor mode). I don't think I have a better setup readily available. At any rate, I switched to HT20 (channel 44), and uploaded more PCAPs in that Box folder.

The good file captures the short period when things seem to work (also here on CloudShark).
The bad1 file was captured after the link broke following a speed test (also here). Apparently, on HT20, the timeouts seem to recover pretty quickly, and towards the end of the capture, simple browsing started working, albeit not very well.
The bad2 file was captured after the link recovered, and I decided to stress it with another speed test, when it breaks again (also here).

I was a bit hasty with my previous comment; those damaged packets were neighborhood noise, and for the purpose of these tests, my only active STA on that channel is the LG Nexus 5 (BCM4339). As you can see, I'm very keen on fixing this ;-) and much appreciate your work.

airend commented 8 years ago

Progress, I think ;-) I was trying to make sense of the information in those captured packets, and based on my limited understanding, it sounded like power saving might be involved. No luck with UAPSD and WMM as possible culprits, but I noticed the not-so-benign changes in bca9b7c. Since my issues got worse as more BARs were sent, I reverted the relevant commits, and things seem OK now.

nbd168 commented 8 years ago

Please try the latest git version without the BAR related reverts. It's good to know that the BAR frames are triggering the issue, but I still need to understand why.

airend commented 8 years ago

It would be really great to understand why, especially that other bugs may lurk, either in this driver, or in the SoC stuff. I always revert everything with your newest commits. Testing https://github.com/openwrt/mt76/commit/d4900fc37f44cd52c4bf1474f0b48a0336ec4e22 on top of https://github.com/openwrt-mirror/openwrt/commit/73edad2df6a018e92420ef4872e1bbd7b9d9d4bb yielded the same timeouts, but I just noticed a couple of interesting RX buffer fixes (e.g., https://github.com/openwrt-mirror/openwrt/commit/73edad2df6a018e92420ef4872e1bbd7b9d9d4bb, https://github.com/openwrt-mirror/openwrt/commit/966bec6badc13cd30bc40298f0e27c68cbd1adb0). I'm currently testing your latest changes on top of https://github.com/openwrt-mirror/openwrt/commit/a6900bd38ef62d8fbf974a8b4be6b11c8b7a73cb.

What still baffles me most is why are these timeouts so much worse when mt7620 does normal routing, versus simple bridging behind another WAN router… Fix https://github.com/openwrt-mirror/openwrt/commit/118b7111459bbe81a9947e82a03cbd47fe1396a9 for mt7621 is intriguing; do you think we have similar issues with mt7620? Also, would @blogic be able to chime in on this very stubborn issue?

P.S. On latest everything, my one 11ac client seems to behave better, but the other 11n clients still suffer periodic timeouts.

nbd168 commented 8 years ago

Found another bug that would mess up BAR transmissions. Please try trunk r47142 with latest mt76

airend commented 8 years ago

Seems, dare I say, fixed ;-) I think your latest mac80211 patch (https://github.com/openwrt-mirror/openwrt/commit/bdeb1661d7d4780a7d8a7a9747827beca42b0069) was the keystone to all this craziness, maybe? Either way, do you think it'd be a good idea to consolidate some of the aggregation-related flags? For example, a valid mtxq->agg_ssn implies mtxq->aggr true, or maybe I'm misunderstanding some of these.

Otherwise, just curious, are we doing a lot more in software than other mac80211 drivers, so we need to be more careful about tracking and sending BARs when aggregating frames?

nbd168 commented 8 years ago

I don't see a good way to consolidate the flags. agg_ssn just tracks the last used sequence number during an aggregation session (we could store it outside of aggregation as well, but it's not needed then). We can't tell from the value whether it is valid or not (0 is a valid value as well), so we can't easily get rid of mtxq->aggr.

In terms of doing things in software vs hardware, there are two main classes of devices: those having aggregation handling in software and those having it in hardware. With ath9k, the software controls everything related to aggregation: sequence number assignment, forming aggregates, selecting rate retry table for each full aggregate. With ath10k, iwlwifi, etc. the firmware handles all these things, the software only does the protocol handshake.

mt76 is somewhere inbetween. Sequence number assignment is handled in software, aggregating frames together into A-MPDUs is handled in hardware. This hardware design is actually not very pleasant to deal with, because it makes it necessary for software to deal with all kinds of stuff, yet it does not give the software enough control to do it well. The driver does not get a reliable tx status or aggregation feedback, so it cannot know which frames exactly a client received and what its receiver aggregation window looks like. Because of this, the driver needs to do stupid things like send BAR frames on station PS wakeups.

In terms of driver complexity, mt76 is a lot simpler than pretty much any comparable driver that allows for software aggregation control. This is made possible by two things:

I added a layer of abstraction that allows mac80211 to control per-station per-TID queues that the driver can pull from, reducing driver complexity, and massively reducing bufferbloat in the driver.
I wrote mt76 completely from scratch, free from all the insanities of typical vendor written code :)

Either way, I'll mark this ticket as fixed now. Thanks a lot for testing! Feel free to reopen if issues re-occur.

airend commented 8 years ago

Thanks so much, @nbd168, not only for all the good work, but also for taking the time to explain how things work. This is great information!

zb87 commented 8 years ago

Hi @nbd168, thanks for your great work. However, I find mt76 is still not stable for some client devices (i.e., iPhone 6 Plus).

I am using Lenovo Y1. I have 3 client devices that support 802.11ac. mt76 works very well on my PC (with Intel 7260 AC) and Tablet (Nexus 9). However, it does not work well on the iPhone 6 Plus (with iOS 9.0.2). Sometimes suddenly all transmissions timeout, while the wifi is still shown as connected on both the router and the iPhone. Manually turn off / turn on wifi on iPhone will make work again. The link can be also recovered automatically after a few minutes. Please tell me if you need any additional information

FYI, I am using Openwrt Chaos Calmer 15.05 stable, with the latest version of mt76 0169cab. To make mt76 work in Chaos Calmer, I've reverted d1a6945 and removed the IEEE80211_HW_SUPPORT_FAST_XMIT flag in init.c.

nbd168 commented 8 years ago

@zb87, the latest fix that I made was in mac80211, not mt76 directly. I have already pushed the relevant fix into the Chaos Calmer Branch and updated mt76 to the latest version there. I have also pushed a hostapd fix that might help with stability on iOS devices. Please try the latest version of the branch as-is to see if it's more stable for you.

zb87 commented 8 years ago

I've tried the new Chaos Calmer branch, everything is working well. Thanks for your nice work @nbd168.

openwrt / mt76

Broken 5G signal connections #10