openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
741 stars 342 forks source link

mt7915e - iphone client connection freeze/hang, possibly roaming related. #773

Closed yogo1212 closed 1 year ago

yogo1212 commented 1 year ago

Setup

1x iphone 11 (ios 16.1.2) client, 2x cudy m1800 access points connected via cable or mesh (same behaviour).

Both access points are running kernel version 5.15.106 and c32d6d849c43792abd8007e13e468b12d6d6e0b7 but the issue was present with previous versions of openwrt and mt76 as well. The config is identical for both devices (details further down). The SSID is in the lan network (which optionally has a batman mesh interface). The lan interface is a bridge with all ethernet ports in it (no DHCP server enabled). The gateway is on another router.

That setup is a reproduction. The error was originally reported for a iphone 12 pro max (ios 16.4.1).

Reproduction and Observations

After moving around between the two access points with the iphone, there will be 4 seemingly-active associations. That's difficult for testing, so I've disabled 2.4 Ghz on both APs and the rest of the issue is about just two associations. Both association's 'inactive times' stay mostly under 200ms. The iphone appears to pick one of these associations to send traffic through. So, far that's a little weird but it should not be a problem per-se. The tx-queues are properly emptied on both routers. The output of iw looks decent (examples further down).

Now, the problem is: While this is going on, the iphone has little to no connectivity. The ping success rate from the device to the lan network correlates strongly with that of pings in the other direction (e.g. they both work and don't roughly at the same time). Many pings are lost (ranging from 20% to 100%). At times, there is enough connectivity to open a web page.

Considerations

Interestingly, the iphone was able to receive notifications while there were no pings. Or at least it appears that it can receive exactly one notification during each connection freeze. This leads me to assume that only the sending direction (from the iphone) is affected and packets towards reach the iphone regardless of which access points sends them. (Rx-queues are always ready).

There's also the idea that the iphone switches between the associations so fast that the switch refuses to update the ARP table and/or drops the packets.

There are no multiple associations with a 2014 lenovo laptop or with an older samsung phone and there's no error there either. That's how the iphone got its position in the issue title.

Out of desperation, I tried @ptpt52's patch from https://github.com/openwrt/mt76/issues/518 just in case it was a queueing error. But it didn't help.

I can't tell whether this would happen with one access point (and two radios) because it's hard to tell the iphone to switch to the other association (so i can't tell whether pakets reach the physical network better than the software bridge).

Possibly related to: https://github.com/openwrt/mt76/issues/672

Logs + Config

Here's about both associations being active on two access points:

Station 92:2d:5e:d9:23:f5 (on w5_0-13d9)
        inactive time:  60 ms

at the same time, on the other router: (calling iw .. station get in a loop)

Station 92:2d:5e:d9:23:f5 (on w5_0-13d9)
        inactive time:  50 ms

inactive time stays low on both routers. No idea if this is the new intended roaming behaviour or whether the iphone isn't happy enough with the new connection and lingers while it's making up its mind.

Here is the untrimmed output of iw for completeness:

Station 92:2d:5e:d9:23:f5 (on w5_0-13d9)
        inactive time:  60 ms
        rx bytes:       108046
        rx packets:     1977
        tx bytes:       157614
        tx packets:     540
        tx retries:     0
        tx failed:      5
        rx drop misc:   19
        signal:         -75 [-77, -79] dBm
        signal avg:     -74 [-76, -78] dBm
        tx bitrate:     306.2 MBit/s 80MHz HE-MCS 6 HE-NSS 1 HE-GI 1 HE-DCM 0
        tx duration:    9588287 us
        rx bitrate:     24.0 MBit/s
        rx duration:    129473 us
        last ack signal:-75 dBm
        avg ack signal: -75 dBm
        airtime weight: 256
        authorized:     yes
        authenticated:  yes
        associated:     yes
        preamble:       long
        WMM/WME:        yes
        MFP:            yes
        TDLS peer:      no
        DTIM period:    2
        beacon interval:100
        short slot time:yes
        connected time: 154 seconds
        associated at [boottime]:       19315.997s
        associated at:  1682181990934 ms
        current time:   1682182144926 ms

Station 92:2d:5e:d9:23:f5 (on w5_0-13d9)
        inactive time:  50 ms
        rx bytes:       495005
        rx packets:     11869
        tx bytes:       439062
        tx packets:     3936
        tx retries:     0
        tx failed:      14
        rx drop misc:   196
        signal:         -64 [-73, -64] dBm
        signal avg:     -64 [-73, -64] dBm
        tx bitrate:     6.0 MBit/s
        tx duration:    3172443 us
        rx bitrate:     24.0 MBit/s 80MHz
        rx duration:    1269901 us
        last ack signal:-67 dBm
        avg ack signal: -66 dBm
        airtime weight: 256
        authorized:     yes
        authenticated:  yes
        associated:     yes
        preamble:       long
        WMM/WME:        yes
        MFP:            yes
        TDLS peer:      no
        DTIM period:    2
        beacon interval:100
        short slot time:yes
        connected time: 391 seconds
        associated at [boottime]:       25422.450s
        associated at:  1682181754354 ms
        current time:   1682182145790 ms

This is what hostapd has to say:

Sat Apr 22 17:00:02 2023 daemon.notice hostapd: w5_0-13d9: Prune association for 92:2d:5e:d9:23:f5
Sat Apr 22 17:00:02 2023 daemon.notice hostapd: w5_0-13d9: AP-STA-DISCONNECTED 92:2d:5e:d9:23:f5
Sat Apr 22 17:00:32 2023 daemon.info hostapd: w5_0-13d9: STA 92:2d:5e:d9:23:f5 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Sat Apr 22 17:00:34 2023 daemon.info hostapd: w5_0-13d9: STA 92:2d:5e:d9:23:f5 IEEE 802.11: associated (aid 1)
Sat Apr 22 17:00:34 2023 daemon.notice hostapd: w5_0-13d9: AP-STA-CONNECTED 92:2d:5e:d9:23:f5 auth_alg=ft
Sat Apr 22 17:00:40 2023 daemon.notice hostapd: w5_0-13d9: Prune association for 92:2d:5e:d9:23:f5
Sat Apr 22 17:00:40 2023 daemon.notice hostapd: w5_0-13d9: AP-STA-DISCONNECTED 92:2d:5e:d9:23:f5
Sat Apr 22 17:01:10 2023 daemon.info hostapd: w5_0-13d9: STA 92:2d:5e:d9:23:f5 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Sat Apr 22 17:01:42 2023 daemon.info hostapd: w5_0-13d9: STA 92:2d:5e:d9:23:f5 IEEE 802.11: associated (aid 1)
Sat Apr 22 17:01:42 2023 daemon.notice hostapd: w5_0-13d9: AP-STA-CONNECTED 92:2d:5e:d9:23:f5 auth_alg=ft
Sat Apr 22 17:01:47 2023 daemon.info hostapd: w5_0-13d9: STA 92:2d:5e:d9:23:f5 IEEE 802.11: authenticated

wireless config:

config wifi-device 'radio5_0'
        option type 'mac80211'
        option path '1e140000.pcie/pci0000:00/0000:00:01.0/0000:02:00.0+1'
        option channel '36'
        option band '5g'
        option htmode 'HE80'
        option country 'DE'

config wifi-iface 'w5_0_13d9'                                           
        option mode 'ap'         
        option ifname 'w5_0-13d9' 
        option device 'radio5_0'
        option network 'lan'  
        option encryption 'psk2'
        option isolate '0'      
        option hidden '0'              
        option key 'chocolate'        
        option ieee80211r '1'          
        option ieee80211w '1'                                             
        option ssid 'test ssid'              

if anyone has any idea, i'm willing to test it!

Brain2000 commented 1 year ago

I have noticed as well that lots of iphone station adds/removes ends up causing a crash much more readily than android. My gut reaction tells me that it has something to do with multicast, mDNS (Bonjour) in particular, since I can crash openwrt in seconds with a multicast DoS attack.

I've been trying to track down all the crashing bugs in openwrt for the last couple of months. Through the help of others we've already gotten several fixes into the code, and I feel there are still at least 2 or 3 to go.

The answer may lie in the fact that the net/bridge code (which deals with adding/removing stations/vlans/etc...) is an older version of the code and probably needs to be updated to a newer version. The functions have changed as compared to what is currently in OpenWRT.

yogo1212 commented 1 year ago

Hi @Brain2000

The answer may lie in the fact that the net/bridge code (which deals with adding/removing stations/vlans/etc...) is an older version of the code and probably needs to be updated to a newer version. The functions have changed as compared to what is currently in OpenWRT.

The stuff in https://github.com/openwrt/openwrt/tree/master/target/linux/generic/files/drivers doesn't apply here. The phy+switch are DSA. Other than that, I don't know what you could mean. Could you point to the old code you've mentioned?

I'm far from having ruled out other sources of problems like batman or 802.11ax. But multicasts are unlikely to be the root cause here, imho. I assume they would be more peripheral. It could be something to do with the rate at which multicasts are being sent (going by your idea of an 'multicast DoS attack''). An iphone 7 roams happily, though. DHCP tends to go through fine as well (same gateway, same client address) while roaming.

So to reduce multicast traffic between the access points, I created an additional SSID in an new network where each access point is a DHCP server + gateway (with the same IP). This worked a lot better! But multicasts aren't the only thing that's changed.

Hmm..

yogo1212 commented 1 year ago

Following the line-of-thought from the previous comment based on @Brain2000 's input and with some more testing, I've reached the conclusion that the error only occurs if the SSIDs on both routers are connected via layer 2. Yes, everything's pretty smooth without HE/802.11ax but that's likely more a trigger than a cause. Atm, there's no good reason to assume the wifi connection itself isn't alright.

No idea where this issue belongs now. Openwrt, Linux, Batman?

Regardless: closing this.

Brain2000 commented 1 year ago

Hi @Brain2000

The answer may lie in the fact that the net/bridge code (which deals with adding/removing stations/vlans/etc...) is an older version of the code and probably needs to be updated to a newer version. The functions have changed as compared to what is currently in OpenWRT.

The stuff in https://github.com/openwrt/openwrt/tree/master/target/linux/generic/files/drivers doesn't apply here. The phy+switch are DSA. Other than that, I don't know what you could mean. Could you point to the old code you've mentioned?

I'm far from having ruled out other sources of problems like batman or 802.11ax. But multicasts are unlikely to be the root cause here, imho. I assume they would be more peripheral. It could be something to do with the rate at which multicasts are being sent (going by your idea of an 'multicast DoS attack''). An iphone 7 roams happily, though. DHCP tends to go through fine as well (same gateway, same client address) while roaming.

So to reduce multicast traffic between the access points, I created an additional SSID in an new network where each access point is a DHCP server + gateway (with the same IP). This worked a lot better! But multicasts aren't the only thing that's changed.

Hmm..

There's an important element to the multicast DoS which links these together. The DoS only happens if while sending multicast packets, airplane mode is turned on or the station roams. I've observed crashes in 22.x.x for months, and it happens almost exclusively when there are iphones roaming between wifi access points.

As for old code, I'm referring to versions that go back before DSA, such as 15.05 Chaos Calmer. That version can run for months on end without even so much as an inclination of a crash, where version 22 sometimes crashes in as little as 30 minutes.

Brain2000 commented 1 year ago

Following the line-of-thought from the previous comment based on @Brain2000 's input and with some more testing, I've reached the conclusion that the error only occurs if the SSIDs on both routers are connected via layer 2. Yes, everything's pretty smooth without HE/802.11ax but that's likely more a trigger than a cause. Atm, there's no good reason to assume the wifi connection itself isn't alright.

No idea where this issue belongs now. Openwrt, Linux, Batman?

Regardless: closing this.

After poring through crash dumps and examining code, I believe the crash is happening in the /net/bridge code. I've found that some pointers suddenly become null when a station disconnects, while places in mac8023 are still using them.

Unfortunately it looks like openwrt is using a version of net/bridge before it was refactored. So simply getting that package updated might be enough to fix it, I don't know.

trunneml commented 6 months ago

My iphone connection also freeze/hang after some idle time. I only have one OpenWRT Access Point and the SSID is only for 5GHz so no roaming between access points nor betweend 2,4GHz or 5GHz. Hardware is a Banana PI R3 with the latests 23.05 stable release