Closed yuhhaurlin closed 6 years ago
It looks like this problem happens with hand held devices. I will try to reproduce it with one phone. Your test includes too many devices. You can wait for the fix and help to verify it.
I only see one error in my log:
Nov 21 22:47:39 MCDEBIAN kernel: ieee80211 phy0: create ba result error 1 Nov 21 22:47:39 MCDEBIAN kernel: ieee80211 phy0: ampdu operation error code: -22
@yuhhaurlin In my case, I don't use wireless for laptops and such, so my results don't necessarily indicate it's specifically handhelds.
Please keep in mind that, based on my experience, the issue seems to manifest itself more reliably if two or more devices leave wifi range for an hour or so, then return at the same time. Issues with a single device leaving and returning seem to be much less consistent.
In the meantime, I will focus on trying to reproduce the issue reliably with a single device.
Thank you for your attention to this issue. I really like this router with LEDE; without your efforts on this, none of this would be possible.
I've had this issue occur several times in the past few days on my WRT3200ACM (davidc502 r5389) and I've noticed that the "MEMAddrAccess" issue seems to start when a dual frequency (2.4 and 5GHz) device comes back into network range after several hours (touching both SSIDs). I have not observed a single frequency device (configured for only one SSID) cause the issue.
In my most recent recreate, I can see the device hitting both phy0 and phy1, which is then immediately followed by the errors:
Tue Nov 21 18:01:36 2017 kern.info kernel: [61719.825738] ieee80211 phy1: staid 3 deleted
Tue Nov 21 18:01:36 2017 kern.info kernel: [61719.957752] ieee80211 phy0: staid 1 deleted
Tue Nov 21 18:01:36 2017 daemon.info hostapd: wlan0: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: associated (aid 1)
Tue Nov 21 18:01:45 2017 kern.err kernel: [61729.082449] ieee80211 phy0: cmd 0x9111=SetNewStation timed out
Here's a little larger chunk of the same logs, for more context:
Tue Nov 21 17:52:30 2017 daemon.notice hostapd: wlan1: AP-STA-POLL-OK f0:d7:aa:c6:2e:b4
Tue Nov 21 17:52:36 2017 daemon.notice hostapd: wlan0: AP-STA-POLL-OK f0:d7:aa:c6:2b:96
Tue Nov 21 17:52:38 2017 daemon.notice hostapd: wlan0: AP-STA-POLL-OK f0:d7:aa:c6:2e:b4
Tue Nov 21 17:57:33 2017 daemon.notice hostapd: wlan1: AP-STA-POLL-OK f0:d7:aa:c6:2e:b4
Tue Nov 21 17:57:39 2017 daemon.notice hostapd: wlan0: AP-STA-POLL-OK f0:d7:aa:c6:2b:96
Tue Nov 21 17:57:41 2017 daemon.notice hostapd: wlan0: AP-STA-POLL-OK f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:02 2017 kern.info kernel: [61505.750462] ieee80211 phy1: staid 3 deleted
Tue Nov 21 17:58:02 2017 daemon.info hostapd: wlan1: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: associated (aid 3)
Tue Nov 21 17:58:03 2017 daemon.notice hostapd: wlan1: AP-STA-CONNECTED f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:03 2017 daemon.info hostapd: wlan1: STA f0:d7:aa:c6:2e:b4 WPA: pairwise key handshake completed (RSN)
Tue Nov 21 17:58:03 2017 daemon.info dnsmasq-dhcp[3162]: DHCPDISCOVER(br-lan) f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:03 2017 daemon.info dnsmasq-dhcp[3162]: DHCPOFFER(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:05 2017 daemon.info dnsmasq-dhcp[3162]: DHCPDISCOVER(br-lan) f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:05 2017 daemon.info dnsmasq-dhcp[3162]: DHCPOFFER(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:08 2017 daemon.info dnsmasq-dhcp[3162]: DHCPDISCOVER(br-lan) f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:08 2017 daemon.info dnsmasq-dhcp[3162]: DHCPOFFER(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:08 2017 daemon.info dnsmasq-dhcp[3162]: DHCPREQUEST(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:08 2017 daemon.info dnsmasq-dhcp[3162]: DHCPACK(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4 Phone
Tue Nov 21 17:58:11 2017 daemon.info hostapd: wlan1: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: authenticated
Tue Nov 21 17:58:17 2017 daemon.notice hostapd: wlan0: AP-STA-POLL-OK b4:52:7e:3a:a4:04
Tue Nov 21 18:01:36 2017 daemon.notice hostapd: wlan1: AP-STA-DISCONNECTED f0:d7:aa:c6:2e:b4
Tue Nov 21 18:01:36 2017 kern.info kernel: [61719.825738] ieee80211 phy1: staid 3 deleted
Tue Nov 21 18:01:36 2017 kern.info kernel: [61719.957752] ieee80211 phy0: staid 1 deleted
Tue Nov 21 18:01:36 2017 daemon.info hostapd: wlan0: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: associated (aid 1)
Tue Nov 21 18:01:45 2017 kern.err kernel: [61729.082449] ieee80211 phy0: cmd 0x9111=SetNewStation timed out
Tue Nov 21 18:01:45 2017 kern.err kernel: [61729.088310] ieee80211 phy0: return code: 0x1111
Tue Nov 21 18:01:45 2017 kern.err kernel: [61729.092866] ieee80211 phy0: timeout: 0x1111
Tue Nov 21 18:01:45 2017 daemon.info hostapd: wlan0: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: deauthenticated due to local deauth request
Tue Nov 21 18:01:53 2017 kern.err kernel: [61737.110195] ieee80211 phy0: cmd 0x801d=MEMAddrAccess timed out
Tue Nov 21 18:01:53 2017 kern.err kernel: [61737.116080] ieee80211 phy0: return code: 0x001d
Tue Nov 21 18:01:53 2017 kern.err kernel: [61737.120635] ieee80211 phy0: timeout: 0x001d
The last three lines above repeat every 4-ish seconds, until:
Tue Nov 21 18:02:16 2017 daemon.info hostapd: wlan1: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: associated (aid 3)
Tue Nov 21 18:02:16 2017 daemon.info hostapd: wlan1: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: authenticated
Tue Nov 21 18:02:17 2017 daemon.notice hostapd: wlan1: AP-STA-CONNECTED f0:d7:aa:c6:2e:b4
Tue Nov 21 18:02:17 2017 daemon.info hostapd: wlan1: STA f0:d7:aa:c6:2e:b4 WPA: pairwise key handshake completed (RSN)
Tue Nov 21 18:02:17 2017 daemon.info dnsmasq-dhcp[3162]: DHCPDISCOVER(br-lan) f0:d7:aa:c6:2e:b4
Tue Nov 21 18:02:17 2017 daemon.info dnsmasq-dhcp[3162]: DHCPOFFER(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4
Tue Nov 21 18:02:17 2017 daemon.info dnsmasq-dhcp[3162]: DHCPREQUEST(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4
Tue Nov 21 18:02:17 2017 daemon.info dnsmasq-dhcp[3162]: DHCPACK(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4 Phone
Tue Nov 21 18:02:23 2017 kern.err kernel: [61767.111803] ieee80211 phy0: cmd 0x801d=MEMAddrAccess timed out
Tue Nov 21 18:02:23 2017 kern.err kernel: [61767.117667] ieee80211 phy0: return code: 0x001d
Tue Nov 21 18:02:23 2017 kern.err kernel: [61767.122224] ieee80211 phy0: timeout: 0x001d
Tue Nov 21 18:02:27 2017 kern.err kernel: [61771.125738] ieee80211 phy0: cmd 0x801d=MEMAddrAccess timed out
Tue Nov 21 18:02:27 2017 kern.err kernel: [61771.131604] ieee80211 phy0: return code: 0x001d
Tue Nov 21 18:02:27 2017 kern.err kernel: [61771.136151] ieee80211 phy0: timeout: 0x001d
The last three lines above continue indefinitely.
EDIT Here's another example:
Wed Nov 22 02:57:15 2017 daemon.info hostapd: wlan1: STA dc:a9:71:e4:34:46 IEEE 802.11: associated (aid 1)
Wed Nov 22 02:57:15 2017 daemon.notice hostapd: wlan1: AP-STA-CONNECTED dc:a9:71:e4:34:46
Wed Nov 22 02:57:15 2017 daemon.info hostapd: wlan1: STA dc:a9:71:e4:34:46 WPA: pairwise key handshake completed (RSN)
Wed Nov 22 02:57:15 2017 daemon.info dnsmasq-dhcp[8019]: DHCPREQUEST(br-lan) 192.168.1.142 dc:a9:71:e4:34:46
Wed Nov 22 02:57:15 2017 daemon.info dnsmasq-dhcp[8019]: DHCPACK(br-lan) 192.168.1.142 dc:a9:71:e4:34:46
Wed Nov 22 02:57:16 2017 daemon.info hostapd: wlan1: STA dc:a9:71:e4:34:46 IEEE 802.11: authenticated
Wed Nov 22 02:59:18 2017 kern.info kernel: [27942.912423] ieee80211 phy0: staid 3 deleted
Wed Nov 22 02:59:18 2017 daemon.info hostapd: wlan0: STA f0:d7:aa:c6:2b:96 IEEE 802.11: authenticated
Wed Nov 22 02:59:18 2017 daemon.info hostapd: wlan0: STA f0:d7:aa:c6:2b:96 IEEE 802.11: associated (aid 3)
Wed Nov 22 02:59:28 2017 kern.err kernel: [27952.937095] ieee80211 phy0: cmd 0x9111=SetNewStation timed out
Wed Nov 22 02:59:28 2017 kern.err kernel: [27952.942956] ieee80211 phy0: return code: 0x1111
Wed Nov 22 02:59:28 2017 kern.err kernel: [27952.947512] ieee80211 phy0: timeout: 0x1111
Wed Nov 22 02:59:28 2017 daemon.info hostapd: wlan0: STA f0:d7:aa:c6:2b:96 IEEE 802.11: deauthenticated due to local deauth request
Wed Nov 22 02:59:32 2017 kern.err kernel: [27956.957021] ieee80211 phy0: cmd 0x801d=MEMAddrAccess timed out
Wed Nov 22 02:59:32 2017 kern.err kernel: [27956.962882] ieee80211 phy0: return code: 0x001d
Wed Nov 22 02:59:32 2017 kern.err kernel: [27956.967445] ieee80211 phy0: timeout: 0x001d
@michaelmcmaster You bring up a good point. Generally, our devices will be on the 5GHz band when they leave coverage, but come back in on the 2.4GHz band (as it has better range). They will then switch to the 5GHz band at some point after being used.
I'm using 2.4G only and still the same issue. It happens ~3days of usage.
@wongsyrone Did you have hand held devices connected to your router?
About 4 devices:
Yes. It looks the problem is related to hand held devices. Thanks.
@wongsyrone Interesting, I also have an Intel ac7260 card (ultrabook) on my network as well and experience the same "MEMAddrAccess" issue approx. every 2-3 days. Also have two iPhones and two iPads.
"MEMAddrAccess" occurs on LEDE Stable and trunk. Trunk eventually crashes, while stable keeps running with high load but does not crash.
It seems @yuhhaurlin can use Intel ac7260 to make things easier.
Can you just use ac7260 to see if you will encounter problem?
I'm using exclusively 5ghz, only when I troubleshoot a foreign device that does not support 5Ghz I'm using 2,4. I have an Realtek AC USB device, I could try unplug it several times and reconnect it after specified amount of time
Hi, some more observations. i left from my flat a hour ago and , iphone 7 is in my flat, and ipad Air 2 is with me 3 km east. All other phoines, laptops are aslo with me(android and intel wifi cards), ipad is still connected ?!?!
root@3200acm:~# iw wlan0 station dump Station ipadAIR (on wlan0) inactive time: 4302830 ms rx bytes: 1084687 rx packets: 8686 tx bytes: 2236225 tx packets: 7364 tx retries: 0 tx failed: 0 rx drop misc: 0 signal: -60 dBm signal avg: -38 dBm tx bitrate: 585.1 MBit/s VHT-MCS 6 80MHz short GI VHT-NSS 2 rx bitrate: 866.7 MBit/s VHT-MCS 9 80MHz short GI VHT-NSS 2 authorized: yes authenticated: yes associated: yes preamble: long WMM/WME: yes MFP: no TDLS peer: no DTIM period: 2 beacon interval:100 short slot time:yes connected time: 40026 seconds Station iphone 7 (on wlan0) inactive time: 49300 ms rx bytes: 9162385 rx packets: 44328 tx bytes: 264692682 tx packets: 181411 tx retries: 0 tx failed: 0 rx drop misc: 0 signal: -55 dBm signal avg: -57 dBm tx bitrate: 866.7 MBit/s VHT-MCS 9 80MHz short GI VHT-NSS 2 rx bitrate: 585.0 MBit/s VHT-MCS 7 80MHz VHT-NSS 2 authorized: yes authenticated: yes associated: yes preamble: long WMM/WME: yes MFP: no TDLS peer: no DTIM period: 2 beacon interval:100 short slot time:yes connected time: 6622 seconds
is a known bug, if you don't disable the wireless on the phone/tablet before going out of range, the stations will still be shown as connected
I have been trying to reproduce the issue for approx. "2d 12h 15m" now with only the Intel ac7260 chipset. I have kept all of the Apple mobile devices off the network during this time. However, I have not yet been able to reproduce the problem with only the Intel ac7260 chipset. I will keep testing and see what happens.
it's normal that rmmod mwlwifi
to end in crash ?
I tried it after issuing wifi down
after the driver hanged and after a fresh restart, in every case, the kernel panicked
Current uptime is approx. "3d 11h 32m" now with only the Intel chipset on the network. I am starting to believe that the issue is more to do with the mobile devices (which I have kept off the network during this test). In my case, specifically two iPads and two iPhones.
@WildByDesign Generally speaking, when only one iDevice leaves my home then returns at least an hour later, it won't crash. We have to have two leave and return at the same time. It doesn't, however, appear to be restricted to iDevices, though. @michaelmcmaster's dumps he recently provided had Sony, Intel and Motorola MAC addresses.
It seems odd that it would be restricted to mobile devices only, though. Unless, of course, it's some sort of race condition that requires that two previously-registered devices return and reregister at least an hour later, close to the same time. That would be sort of unusual for two laptops.
for me only one device is enough, current uptime before crash 22h, but for sure the test firmware lasted longer
I'm not seeing any crashes on the McDebian beta using the latest mwlwifi commit. Maybe the issue is specific to LEDE.
It's not just LEDE, DD-WRT has this issue as well still.
@lucize The device is not hand held device?
These crashes will not happen if you are not going in and out of range with certain mobile devices, I know for sure Samsung S7 with Qualcomm chipset will crash it. If your not using these mobile devices and not going in and out of range then this will last forever with no issues.
Thanks.
@yuhhaurlin it is, either one of the phones, I tried with the usb stick many times and after various hours and couldn't make it crash, if I unplug the device from usb, the router still thinks that is connected , if I click disconnect from the wifi interface, then it will be cleanly removed from the status of the router. Unfortunately I don't know how to disable wireless power sawing on the android phones that I use daily, none of them are rooted. I have a motorola droid 4 and a photon q that can be rooted and I know a bit about those and made some commits on the kernel and userspace, I'll try wit them there are times when it crashes and no notebook where connected to wireless, only phones
I don't have a specific setting for power saving for wireless, but there is a (general) power saving mode that does change the way wireless works as well as a few other things such as display timeout etc.. so I have tried both modes and both seem to trigger this issue in about the same amount of loss/reacquisition scenarios.
These newer phones are using Android Nougat.
Edit: None of the phones I'm using are Rooted yet.
or disable the WMM Mode
from the AP?
I can put the S7 in a large stainless pot with lid and walk about 300 yards ~ 300 meters and lose signal then take off lid and require. Do this and wait for about one half hour to 45 minutes to reacquire and the router will crash.
@lucize and @aaron1959 how recent are your builds? It appears mine has been more stable since the most recent change by @yuhhaurlin--at least I've not seen an instance of a "single device leaves/returns" crash. All of the crashes I've seen have required two devices.
almost after every crash I update the HEAD
Thanks for all of your information. Now we can make sure only hand held device will make this problem happened.
A single samsung S7 will crash it every time without fail. @lucize I just disabled WMM on the atho and the connection dropped and when it came back speed was 54 Mbs and said Legacy, I swithced wmm on and speed went back to 866 Mbs VHT80SGI.
Please don't disable WMM.
I will try to unplug antenna of my device to see if I can reproduce this problem easily.
It has to completely drop connection and then wait about an hour when it tries to reconnect the router will start having issues first all 5 Ghz traffic stops, then web UI becomes unresponsive at times. sometimes if the UI is responding turning off ath0 and turning back on will restart ath0 and connections will be fine for awhile. this has happened twice to me.
ieee80211 phy0: staid 1 deleted [17607.339225] ieee80211 phy0: cmd 0x9122=UpdateEncryption timed out [17607.345348] ieee80211 phy0: return code: 0x1122 [17607.349908] ieee80211 phy0: timeout: 0x1122 [17607.354109] wlan0: failed to set key (0, a8:5c:2c:44:6f:66) to hardware (-5) [17622.319290] ieee80211 phy1: staid 2 deleted [17687.647667] ieee80211 phy1: cmd 0x8050=broadcast_ssid_enable timed out [17687.654227] ieee80211 phy1: return code: 0x0550 [17687.658788] ieee80211 phy1: timeout: 0x0050 [17691.653923] ieee80211 phy1: cmd 0x8127=SetInformationElements timed out [17691.660577] ieee80211 phy1: return code: 0x0127 [17691.665126] ieee80211 phy1: timeout: 0x0127 [17695.662361] ieee80211 phy1: cmd 0x801c=80211RadioControl timed out [17695.668579] ieee80211 phy1: return code: 0x001c [17695.673127] ieee80211 phy1: timeout: 0x001c [17699.665677] ieee80211 phy1: cmd 0x8126=SetFixedRate timed out [17699.671461] ieee80211 phy1: return code: 0x0126 [17699.676009] ieee80211 phy1: timeout: 0x0126 [17874.066182] ieee80211 phy1: cmd 0x9143=GetSeqno timed out [17874.071609] ieee80211 phy1: return code: 0x1143 [17874.076169] ieee80211 phy1: timeout: 0x1143 [17907.359392] ieee80211 phy0: cmd 0x9143=GetSeqno timed out [17907.364825] ieee80211 phy0: return code: 0x1143 [17907.369373] ieee80211 phy0: timeout: 0x1143 [17927.363873] ieee80211 phy1: cmd 0x9143=GetSeqno timed out [17927.369304] ieee80211 phy1: return code: 0x1143 [17927.373852] ieee80211 phy1: timeout: 0x1143 [17931.423828] ieee80211 phy1: cmd 0x9143=GetSeqno timed out [17931.429253] ieee80211 phy1: return code: 0x1143 [17931.433801] ieee80211 phy1: timeout: 0x1143 root@3200acm:~#
@gsustek please post your /var/run/hostapd-phy*.conf files so I can try replicate this issue as too.
hostapd-phy0.conf.txt you can test with this
Thanks I will test it and report back results.
To all the people having these issues, have you tried re-downloading the source and rebuild the toolchain?
Always compiling from source and using GCC 7.2
@lucize I can't test the hostapd config lines below because they are strictly OpenWRT customization patches for mwlwifi or ieee80211:
Line 13: unknown configuration item 'ht_coex'
Line 17: unknown configuration item 'wpa_disable_eapol_key_retries'
those options shouldn't matter
Updating toolchain and rebuilding from source using gcc 6.3 every time there's a crash.
maybe try GCC 7.2, that's what I've been using on Fedora, Ubuntu 16.04 and 17.10. I remember 6.X giving me bugs on debian which is why i moved to ubuntu 16.04
that doesnt makes a difference. dd-wrt uses 7.2 for instance and the same issue is present
And I am experiencing this issue with GCC 7.2 dd-wrt.
The Issue only occurs under very specific circumstances, all listed above. Something needs to be fixed that only one of us has control over. These other ideas are great but to me it's like putting new tires on a car with a blown engine and constantly changing the tires hoping the car suddenly starts running.
@BrainSlayer noted
@aaron1959 I believe the issue is that the problem isn't easy to reproduce universally. On my build, one device leaving for 1.5 hours and coming back will not crash the router anymore (it hasn't for the last couple of weeks). Only if two devices leave and return simultaneously does the router crash. Case-in-point, my router has been up since Tuesday morning with no issues, as only individual devices have left and returned. Both of us will be leaving this evening, so I expect the router will, in fact, crash.
Please use this one to check this issue.