Closed yogo1212 closed 1 year ago
I have noticed as well that lots of iphone station adds/removes ends up causing a crash much more readily than android. My gut reaction tells me that it has something to do with multicast, mDNS (Bonjour) in particular, since I can crash openwrt in seconds with a multicast DoS attack.
I've been trying to track down all the crashing bugs in openwrt for the last couple of months. Through the help of others we've already gotten several fixes into the code, and I feel there are still at least 2 or 3 to go.
The answer may lie in the fact that the net/bridge code (which deals with adding/removing stations/vlans/etc...) is an older version of the code and probably needs to be updated to a newer version. The functions have changed as compared to what is currently in OpenWRT.
Hi @Brain2000
The answer may lie in the fact that the net/bridge code (which deals with adding/removing stations/vlans/etc...) is an older version of the code and probably needs to be updated to a newer version. The functions have changed as compared to what is currently in OpenWRT.
The stuff in https://github.com/openwrt/openwrt/tree/master/target/linux/generic/files/drivers doesn't apply here. The phy+switch are DSA. Other than that, I don't know what you could mean. Could you point to the old code you've mentioned?
I'm far from having ruled out other sources of problems like batman or 802.11ax. But multicasts are unlikely to be the root cause here, imho. I assume they would be more peripheral. It could be something to do with the rate at which multicasts are being sent (going by your idea of an 'multicast DoS attack''). An iphone 7 roams happily, though. DHCP tends to go through fine as well (same gateway, same client address) while roaming.
So to reduce multicast traffic between the access points, I created an additional SSID in an new network where each access point is a DHCP server + gateway (with the same IP). This worked a lot better! But multicasts aren't the only thing that's changed.
Hmm..
Following the line-of-thought from the previous comment based on @Brain2000 's input and with some more testing, I've reached the conclusion that the error only occurs if the SSIDs on both routers are connected via layer 2. Yes, everything's pretty smooth without HE/802.11ax but that's likely more a trigger than a cause. Atm, there's no good reason to assume the wifi connection itself isn't alright.
No idea where this issue belongs now. Openwrt, Linux, Batman?
Regardless: closing this.
Hi @Brain2000
The answer may lie in the fact that the net/bridge code (which deals with adding/removing stations/vlans/etc...) is an older version of the code and probably needs to be updated to a newer version. The functions have changed as compared to what is currently in OpenWRT.
The stuff in https://github.com/openwrt/openwrt/tree/master/target/linux/generic/files/drivers doesn't apply here. The phy+switch are DSA. Other than that, I don't know what you could mean. Could you point to the old code you've mentioned?
I'm far from having ruled out other sources of problems like batman or 802.11ax. But multicasts are unlikely to be the root cause here, imho. I assume they would be more peripheral. It could be something to do with the rate at which multicasts are being sent (going by your idea of an 'multicast DoS attack''). An iphone 7 roams happily, though. DHCP tends to go through fine as well (same gateway, same client address) while roaming.
So to reduce multicast traffic between the access points, I created an additional SSID in an new network where each access point is a DHCP server + gateway (with the same IP). This worked a lot better! But multicasts aren't the only thing that's changed.
Hmm..
There's an important element to the multicast DoS which links these together. The DoS only happens if while sending multicast packets, airplane mode is turned on or the station roams. I've observed crashes in 22.x.x for months, and it happens almost exclusively when there are iphones roaming between wifi access points.
As for old code, I'm referring to versions that go back before DSA, such as 15.05 Chaos Calmer. That version can run for months on end without even so much as an inclination of a crash, where version 22 sometimes crashes in as little as 30 minutes.
Following the line-of-thought from the previous comment based on @Brain2000 's input and with some more testing, I've reached the conclusion that the error only occurs if the SSIDs on both routers are connected via layer 2. Yes, everything's pretty smooth without HE/802.11ax but that's likely more a trigger than a cause. Atm, there's no good reason to assume the wifi connection itself isn't alright.
No idea where this issue belongs now. Openwrt, Linux, Batman?
Regardless: closing this.
After poring through crash dumps and examining code, I believe the crash is happening in the /net/bridge code. I've found that some pointers suddenly become null when a station disconnects, while places in mac8023 are still using them.
Unfortunately it looks like openwrt is using a version of net/bridge before it was refactored. So simply getting that package updated might be enough to fix it, I don't know.
My iphone connection also freeze/hang after some idle time. I only have one OpenWRT Access Point and the SSID is only for 5GHz so no roaming between access points nor betweend 2,4GHz or 5GHz. Hardware is a Banana PI R3 with the latests 23.05 stable release
Setup
1x iphone 11 (ios 16.1.2) client, 2x cudy m1800 access points connected via cable or mesh (same behaviour).
Both access points are running kernel version 5.15.106 and c32d6d849c43792abd8007e13e468b12d6d6e0b7 but the issue was present with previous versions of openwrt and mt76 as well. The config is identical for both devices (details further down). The SSID is in the
lan
network (which optionally has a batman mesh interface). The lan interface is a bridge with all ethernet ports in it (no DHCP server enabled). The gateway is on another router.That setup is a reproduction. The error was originally reported for a iphone 12 pro max (ios 16.4.1).
Reproduction and Observations
After moving around between the two access points with the iphone, there will be 4 seemingly-active associations. That's difficult for testing, so I've disabled 2.4 Ghz on both APs and the rest of the issue is about just two associations. Both association's 'inactive times' stay mostly under 200ms. The iphone appears to pick one of these associations to send traffic through. So, far that's a little weird but it should not be a problem per-se. The tx-queues are properly emptied on both routers. The output of
iw
looks decent (examples further down).Now, the problem is: While this is going on, the iphone has little to no connectivity. The ping success rate from the device to the lan network correlates strongly with that of pings in the other direction (e.g. they both work and don't roughly at the same time). Many pings are lost (ranging from 20% to 100%). At times, there is enough connectivity to open a web page.
Considerations
Interestingly, the iphone was able to receive notifications while there were no pings. Or at least it appears that it can receive exactly one notification during each connection freeze. This leads me to assume that only the sending direction (from the iphone) is affected and packets towards reach the iphone regardless of which access points sends them. (Rx-queues are always ready).
There's also the idea that the iphone switches between the associations so fast that the switch refuses to update the ARP table and/or drops the packets.
There are no multiple associations with a 2014 lenovo laptop or with an older samsung phone and there's no error there either. That's how the iphone got its position in the issue title.
Out of desperation, I tried @ptpt52's patch from https://github.com/openwrt/mt76/issues/518 just in case it was a queueing error. But it didn't help.
I can't tell whether this would happen with one access point (and two radios) because it's hard to tell the iphone to switch to the other association (so i can't tell whether pakets reach the physical network better than the software bridge).
Possibly related to: https://github.com/openwrt/mt76/issues/672
Logs + Config
Here's about both associations being active on two access points:
at the same time, on the other router: (calling iw .. station get in a loop)
inactive time
stays low on both routers. No idea if this is the new intended roaming behaviour or whether the iphone isn't happy enough with the new connection and lingers while it's making up its mind.Here is the untrimmed output of
iw
for completeness:This is what hostapd has to say:
wireless config:
if anyone has any idea, i'm willing to test it!