kaloz / mwlwifi

mac80211 driver for the Marvell 88W8864 802.11ac chip
394 stars 119 forks source link

Host command timeout #232

Closed yuhhaurlin closed 6 years ago

yuhhaurlin commented 6 years ago

Please use this one to check this issue.

yuhhaurlin commented 6 years ago

It looks like this problem happens with hand held devices. I will try to reproduce it with one phone. Your test includes too many devices. You can wait for the fix and help to verify it.

Chadster766 commented 6 years ago

I only see one error in my log:

Nov 21 22:47:39 MCDEBIAN kernel: ieee80211 phy0: create ba result error 1 Nov 21 22:47:39 MCDEBIAN kernel: ieee80211 phy0: ampdu operation error code: -22

ratsputin commented 6 years ago

@yuhhaurlin In my case, I don't use wireless for laptops and such, so my results don't necessarily indicate it's specifically handhelds.

Please keep in mind that, based on my experience, the issue seems to manifest itself more reliably if two or more devices leave wifi range for an hour or so, then return at the same time. Issues with a single device leaving and returning seem to be much less consistent.

In the meantime, I will focus on trying to reproduce the issue reliably with a single device.

Thank you for your attention to this issue. I really like this router with LEDE; without your efforts on this, none of this would be possible.

michaelmcmaster commented 6 years ago

I've had this issue occur several times in the past few days on my WRT3200ACM (davidc502 r5389) and I've noticed that the "MEMAddrAccess" issue seems to start when a dual frequency (2.4 and 5GHz) device comes back into network range after several hours (touching both SSIDs). I have not observed a single frequency device (configured for only one SSID) cause the issue.

In my most recent recreate, I can see the device hitting both phy0 and phy1, which is then immediately followed by the errors:

Tue Nov 21 18:01:36 2017 kern.info kernel: [61719.825738] ieee80211 phy1: staid 3 deleted
Tue Nov 21 18:01:36 2017 kern.info kernel: [61719.957752] ieee80211 phy0: staid 1 deleted
Tue Nov 21 18:01:36 2017 daemon.info hostapd: wlan0: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: associated (aid 1)
Tue Nov 21 18:01:45 2017 kern.err kernel: [61729.082449] ieee80211 phy0: cmd 0x9111=SetNewStation timed out

Here's a little larger chunk of the same logs, for more context:

Tue Nov 21 17:52:30 2017 daemon.notice hostapd: wlan1: AP-STA-POLL-OK f0:d7:aa:c6:2e:b4
Tue Nov 21 17:52:36 2017 daemon.notice hostapd: wlan0: AP-STA-POLL-OK f0:d7:aa:c6:2b:96
Tue Nov 21 17:52:38 2017 daemon.notice hostapd: wlan0: AP-STA-POLL-OK f0:d7:aa:c6:2e:b4
Tue Nov 21 17:57:33 2017 daemon.notice hostapd: wlan1: AP-STA-POLL-OK f0:d7:aa:c6:2e:b4
Tue Nov 21 17:57:39 2017 daemon.notice hostapd: wlan0: AP-STA-POLL-OK f0:d7:aa:c6:2b:96
Tue Nov 21 17:57:41 2017 daemon.notice hostapd: wlan0: AP-STA-POLL-OK f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:02 2017 kern.info kernel: [61505.750462] ieee80211 phy1: staid 3 deleted
Tue Nov 21 17:58:02 2017 daemon.info hostapd: wlan1: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: associated (aid 3)
Tue Nov 21 17:58:03 2017 daemon.notice hostapd: wlan1: AP-STA-CONNECTED f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:03 2017 daemon.info hostapd: wlan1: STA f0:d7:aa:c6:2e:b4 WPA: pairwise key handshake completed (RSN)
Tue Nov 21 17:58:03 2017 daemon.info dnsmasq-dhcp[3162]: DHCPDISCOVER(br-lan) f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:03 2017 daemon.info dnsmasq-dhcp[3162]: DHCPOFFER(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:05 2017 daemon.info dnsmasq-dhcp[3162]: DHCPDISCOVER(br-lan) f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:05 2017 daemon.info dnsmasq-dhcp[3162]: DHCPOFFER(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:08 2017 daemon.info dnsmasq-dhcp[3162]: DHCPDISCOVER(br-lan) f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:08 2017 daemon.info dnsmasq-dhcp[3162]: DHCPOFFER(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:08 2017 daemon.info dnsmasq-dhcp[3162]: DHCPREQUEST(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4
Tue Nov 21 17:58:08 2017 daemon.info dnsmasq-dhcp[3162]: DHCPACK(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4 Phone
Tue Nov 21 17:58:11 2017 daemon.info hostapd: wlan1: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: authenticated
Tue Nov 21 17:58:17 2017 daemon.notice hostapd: wlan0: AP-STA-POLL-OK b4:52:7e:3a:a4:04
Tue Nov 21 18:01:36 2017 daemon.notice hostapd: wlan1: AP-STA-DISCONNECTED f0:d7:aa:c6:2e:b4
Tue Nov 21 18:01:36 2017 kern.info kernel: [61719.825738] ieee80211 phy1: staid 3 deleted
Tue Nov 21 18:01:36 2017 kern.info kernel: [61719.957752] ieee80211 phy0: staid 1 deleted
Tue Nov 21 18:01:36 2017 daemon.info hostapd: wlan0: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: associated (aid 1)
Tue Nov 21 18:01:45 2017 kern.err kernel: [61729.082449] ieee80211 phy0: cmd 0x9111=SetNewStation timed out
Tue Nov 21 18:01:45 2017 kern.err kernel: [61729.088310] ieee80211 phy0: return code: 0x1111
Tue Nov 21 18:01:45 2017 kern.err kernel: [61729.092866] ieee80211 phy0: timeout: 0x1111
Tue Nov 21 18:01:45 2017 daemon.info hostapd: wlan0: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: deauthenticated due to local deauth request
Tue Nov 21 18:01:53 2017 kern.err kernel: [61737.110195] ieee80211 phy0: cmd 0x801d=MEMAddrAccess timed out
Tue Nov 21 18:01:53 2017 kern.err kernel: [61737.116080] ieee80211 phy0: return code: 0x001d
Tue Nov 21 18:01:53 2017 kern.err kernel: [61737.120635] ieee80211 phy0: timeout: 0x001d

The last three lines above repeat every 4-ish seconds, until:

Tue Nov 21 18:02:16 2017 daemon.info hostapd: wlan1: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: associated (aid 3)
Tue Nov 21 18:02:16 2017 daemon.info hostapd: wlan1: STA f0:d7:aa:c6:2e:b4 IEEE 802.11: authenticated
Tue Nov 21 18:02:17 2017 daemon.notice hostapd: wlan1: AP-STA-CONNECTED f0:d7:aa:c6:2e:b4
Tue Nov 21 18:02:17 2017 daemon.info hostapd: wlan1: STA f0:d7:aa:c6:2e:b4 WPA: pairwise key handshake completed (RSN)
Tue Nov 21 18:02:17 2017 daemon.info dnsmasq-dhcp[3162]: DHCPDISCOVER(br-lan) f0:d7:aa:c6:2e:b4
Tue Nov 21 18:02:17 2017 daemon.info dnsmasq-dhcp[3162]: DHCPOFFER(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4
Tue Nov 21 18:02:17 2017 daemon.info dnsmasq-dhcp[3162]: DHCPREQUEST(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4
Tue Nov 21 18:02:17 2017 daemon.info dnsmasq-dhcp[3162]: DHCPACK(br-lan) 192.168.1.10 f0:d7:aa:c6:2e:b4 Phone
Tue Nov 21 18:02:23 2017 kern.err kernel: [61767.111803] ieee80211 phy0: cmd 0x801d=MEMAddrAccess timed out
Tue Nov 21 18:02:23 2017 kern.err kernel: [61767.117667] ieee80211 phy0: return code: 0x001d
Tue Nov 21 18:02:23 2017 kern.err kernel: [61767.122224] ieee80211 phy0: timeout: 0x001d
Tue Nov 21 18:02:27 2017 kern.err kernel: [61771.125738] ieee80211 phy0: cmd 0x801d=MEMAddrAccess timed out
Tue Nov 21 18:02:27 2017 kern.err kernel: [61771.131604] ieee80211 phy0: return code: 0x001d
Tue Nov 21 18:02:27 2017 kern.err kernel: [61771.136151] ieee80211 phy0: timeout: 0x001d

The last three lines above continue indefinitely.

EDIT Here's another example:

Wed Nov 22 02:57:15 2017 daemon.info hostapd: wlan1: STA dc:a9:71:e4:34:46 IEEE 802.11: associated (aid 1)
Wed Nov 22 02:57:15 2017 daemon.notice hostapd: wlan1: AP-STA-CONNECTED dc:a9:71:e4:34:46
Wed Nov 22 02:57:15 2017 daemon.info hostapd: wlan1: STA dc:a9:71:e4:34:46 WPA: pairwise key handshake completed (RSN)
Wed Nov 22 02:57:15 2017 daemon.info dnsmasq-dhcp[8019]: DHCPREQUEST(br-lan) 192.168.1.142 dc:a9:71:e4:34:46
Wed Nov 22 02:57:15 2017 daemon.info dnsmasq-dhcp[8019]: DHCPACK(br-lan) 192.168.1.142 dc:a9:71:e4:34:46
Wed Nov 22 02:57:16 2017 daemon.info hostapd: wlan1: STA dc:a9:71:e4:34:46 IEEE 802.11: authenticated
Wed Nov 22 02:59:18 2017 kern.info kernel: [27942.912423] ieee80211 phy0: staid 3 deleted
Wed Nov 22 02:59:18 2017 daemon.info hostapd: wlan0: STA f0:d7:aa:c6:2b:96 IEEE 802.11: authenticated
Wed Nov 22 02:59:18 2017 daemon.info hostapd: wlan0: STA f0:d7:aa:c6:2b:96 IEEE 802.11: associated (aid 3)
Wed Nov 22 02:59:28 2017 kern.err kernel: [27952.937095] ieee80211 phy0: cmd 0x9111=SetNewStation timed out
Wed Nov 22 02:59:28 2017 kern.err kernel: [27952.942956] ieee80211 phy0: return code: 0x1111
Wed Nov 22 02:59:28 2017 kern.err kernel: [27952.947512] ieee80211 phy0: timeout: 0x1111
Wed Nov 22 02:59:28 2017 daemon.info hostapd: wlan0: STA f0:d7:aa:c6:2b:96 IEEE 802.11: deauthenticated due to local deauth request
Wed Nov 22 02:59:32 2017 kern.err kernel: [27956.957021] ieee80211 phy0: cmd 0x801d=MEMAddrAccess timed out
Wed Nov 22 02:59:32 2017 kern.err kernel: [27956.962882] ieee80211 phy0: return code: 0x001d
Wed Nov 22 02:59:32 2017 kern.err kernel: [27956.967445] ieee80211 phy0: timeout: 0x001d
ratsputin commented 6 years ago

@michaelmcmaster You bring up a good point. Generally, our devices will be on the 5GHz band when they leave coverage, but come back in on the 2.4GHz band (as it has better range). They will then switch to the 5GHz band at some point after being used.

wongsyrone commented 6 years ago

I'm using 2.4G only and still the same issue. It happens ~3days of usage.

yuhhaurlin commented 6 years ago

@wongsyrone Did you have hand held devices connected to your router?

wongsyrone commented 6 years ago

About 4 devices:

  1. Asus N53SN laptop with Gigabyte GP-WB867MD-I 802.11ac mini-pcie adapter(aka Intel ac7260)
  2. Motorola Nexus 6
  3. Huawei Nexus 6P
  4. Huawei Honor 6Plus
yuhhaurlin commented 6 years ago

Yes. It looks the problem is related to hand held devices. Thanks.

WildByDesign commented 6 years ago

@wongsyrone Interesting, I also have an Intel ac7260 card (ultrabook) on my network as well and experience the same "MEMAddrAccess" issue approx. every 2-3 days. Also have two iPhones and two iPads.

"MEMAddrAccess" occurs on LEDE Stable and trunk. Trunk eventually crashes, while stable keeps running with high load but does not crash.

wongsyrone commented 6 years ago

It seems @yuhhaurlin can use Intel ac7260 to make things easier.

yuhhaurlin commented 6 years ago

Can you just use ac7260 to see if you will encounter problem?

lucize commented 6 years ago

I'm using exclusively 5ghz, only when I troubleshoot a foreign device that does not support 5Ghz I'm using 2,4. I have an Realtek AC USB device, I could try unplug it several times and reconnect it after specified amount of time

gsustek commented 6 years ago

Hi, some more observations. i left from my flat a hour ago and , iphone 7 is in my flat, and ipad Air 2 is with me 3 km east. All other phoines, laptops are aslo with me(android and intel wifi cards), ipad is still connected ?!?!

root@3200acm:~# iw wlan0 station dump Station ipadAIR (on wlan0) inactive time: 4302830 ms rx bytes: 1084687 rx packets: 8686 tx bytes: 2236225 tx packets: 7364 tx retries: 0 tx failed: 0 rx drop misc: 0 signal: -60 dBm signal avg: -38 dBm tx bitrate: 585.1 MBit/s VHT-MCS 6 80MHz short GI VHT-NSS 2 rx bitrate: 866.7 MBit/s VHT-MCS 9 80MHz short GI VHT-NSS 2 authorized: yes authenticated: yes associated: yes preamble: long WMM/WME: yes MFP: no TDLS peer: no DTIM period: 2 beacon interval:100 short slot time:yes connected time: 40026 seconds Station iphone 7 (on wlan0) inactive time: 49300 ms rx bytes: 9162385 rx packets: 44328 tx bytes: 264692682 tx packets: 181411 tx retries: 0 tx failed: 0 rx drop misc: 0 signal: -55 dBm signal avg: -57 dBm tx bitrate: 866.7 MBit/s VHT-MCS 9 80MHz short GI VHT-NSS 2 rx bitrate: 585.0 MBit/s VHT-MCS 7 80MHz VHT-NSS 2 authorized: yes authenticated: yes associated: yes preamble: long WMM/WME: yes MFP: no TDLS peer: no DTIM period: 2 beacon interval:100 short slot time:yes connected time: 6622 seconds

lucize commented 6 years ago

is a known bug, if you don't disable the wireless on the phone/tablet before going out of range, the stations will still be shown as connected

WildByDesign commented 6 years ago

I have been trying to reproduce the issue for approx. "2d 12h 15m" now with only the Intel ac7260 chipset. I have kept all of the Apple mobile devices off the network during this time. However, I have not yet been able to reproduce the problem with only the Intel ac7260 chipset. I will keep testing and see what happens.

lucize commented 6 years ago

it's normal that rmmod mwlwifi to end in crash ? I tried it after issuing wifi down after the driver hanged and after a fresh restart, in every case, the kernel panicked

WildByDesign commented 6 years ago

Current uptime is approx. "3d 11h 32m" now with only the Intel chipset on the network. I am starting to believe that the issue is more to do with the mobile devices (which I have kept off the network during this test). In my case, specifically two iPads and two iPhones.

ratsputin commented 6 years ago

@WildByDesign Generally speaking, when only one iDevice leaves my home then returns at least an hour later, it won't crash. We have to have two leave and return at the same time. It doesn't, however, appear to be restricted to iDevices, though. @michaelmcmaster's dumps he recently provided had Sony, Intel and Motorola MAC addresses.

It seems odd that it would be restricted to mobile devices only, though. Unless, of course, it's some sort of race condition that requires that two previously-registered devices return and reregister at least an hour later, close to the same time. That would be sort of unusual for two laptops.

lucize commented 6 years ago

for me only one device is enough, current uptime before crash 22h, but for sure the test firmware lasted longer

Chadster766 commented 6 years ago

I'm not seeing any crashes on the McDebian beta using the latest mwlwifi commit. Maybe the issue is specific to LEDE.

https://github.com/Chadster766/McDebian/issues

aaron1959 commented 6 years ago

It's not just LEDE, DD-WRT has this issue as well still.

yuhhaurlin commented 6 years ago

@lucize The device is not hand held device?

aaron1959 commented 6 years ago

These crashes will not happen if you are not going in and out of range with certain mobile devices, I know for sure Samsung S7 with Qualcomm chipset will crash it. If your not using these mobile devices and not going in and out of range then this will last forever with no issues.

yuhhaurlin commented 6 years ago

Thanks.

lucize commented 6 years ago

@yuhhaurlin it is, either one of the phones, I tried with the usb stick many times and after various hours and couldn't make it crash, if I unplug the device from usb, the router still thinks that is connected , if I click disconnect from the wifi interface, then it will be cleanly removed from the status of the router. Unfortunately I don't know how to disable wireless power sawing on the android phones that I use daily, none of them are rooted. I have a motorola droid 4 and a photon q that can be rooted and I know a bit about those and made some commits on the kernel and userspace, I'll try wit them there are times when it crashes and no notebook where connected to wireless, only phones

aaron1959 commented 6 years ago

I don't have a specific setting for power saving for wireless, but there is a (general) power saving mode that does change the way wireless works as well as a few other things such as display timeout etc.. so I have tried both modes and both seem to trigger this issue in about the same amount of loss/reacquisition scenarios.
These newer phones are using Android Nougat.

Edit: None of the phones I'm using are Rooted yet.

lucize commented 6 years ago

or disable the WMM Mode from the AP?

aaron1959 commented 6 years ago

I can put the S7 in a large stainless pot with lid and walk about 300 yards ~ 300 meters and lose signal then take off lid and require. Do this and wait for about one half hour to 45 minutes to reacquire and the router will crash.

ratsputin commented 6 years ago

@lucize and @aaron1959 how recent are your builds? It appears mine has been more stable since the most recent change by @yuhhaurlin--at least I've not seen an instance of a "single device leaves/returns" crash. All of the crashes I've seen have required two devices.

lucize commented 6 years ago

almost after every crash I update the HEAD

yuhhaurlin commented 6 years ago

Thanks for all of your information. Now we can make sure only hand held device will make this problem happened.

aaron1959 commented 6 years ago

A single samsung S7 will crash it every time without fail. @lucize I just disabled WMM on the atho and the connection dropped and when it came back speed was 54 Mbs and said Legacy, I swithced wmm on and speed went back to 866 Mbs VHT80SGI.

yuhhaurlin commented 6 years ago

Please don't disable WMM.

yuhhaurlin commented 6 years ago

I will try to unplug antenna of my device to see if I can reproduce this problem easily.

aaron1959 commented 6 years ago

It has to completely drop connection and then wait about an hour when it tries to reconnect the router will start having issues first all 5 Ghz traffic stops, then web UI becomes unresponsive at times. sometimes if the UI is responding turning off ath0 and turning back on will restart ath0 and connections will be fine for awhile. this has happened twice to me.

gsustek commented 6 years ago

ieee80211 phy0: staid 1 deleted [17607.339225] ieee80211 phy0: cmd 0x9122=UpdateEncryption timed out [17607.345348] ieee80211 phy0: return code: 0x1122 [17607.349908] ieee80211 phy0: timeout: 0x1122 [17607.354109] wlan0: failed to set key (0, a8:5c:2c:44:6f:66) to hardware (-5) [17622.319290] ieee80211 phy1: staid 2 deleted [17687.647667] ieee80211 phy1: cmd 0x8050=broadcast_ssid_enable timed out [17687.654227] ieee80211 phy1: return code: 0x0550 [17687.658788] ieee80211 phy1: timeout: 0x0050 [17691.653923] ieee80211 phy1: cmd 0x8127=SetInformationElements timed out [17691.660577] ieee80211 phy1: return code: 0x0127 [17691.665126] ieee80211 phy1: timeout: 0x0127 [17695.662361] ieee80211 phy1: cmd 0x801c=80211RadioControl timed out [17695.668579] ieee80211 phy1: return code: 0x001c [17695.673127] ieee80211 phy1: timeout: 0x001c [17699.665677] ieee80211 phy1: cmd 0x8126=SetFixedRate timed out [17699.671461] ieee80211 phy1: return code: 0x0126 [17699.676009] ieee80211 phy1: timeout: 0x0126 [17874.066182] ieee80211 phy1: cmd 0x9143=GetSeqno timed out [17874.071609] ieee80211 phy1: return code: 0x1143 [17874.076169] ieee80211 phy1: timeout: 0x1143 [17907.359392] ieee80211 phy0: cmd 0x9143=GetSeqno timed out [17907.364825] ieee80211 phy0: return code: 0x1143 [17907.369373] ieee80211 phy0: timeout: 0x1143 [17927.363873] ieee80211 phy1: cmd 0x9143=GetSeqno timed out [17927.369304] ieee80211 phy1: return code: 0x1143 [17927.373852] ieee80211 phy1: timeout: 0x1143 [17931.423828] ieee80211 phy1: cmd 0x9143=GetSeqno timed out [17931.429253] ieee80211 phy1: return code: 0x1143 [17931.433801] ieee80211 phy1: timeout: 0x1143 root@3200acm:~#

Chadster766 commented 6 years ago

@gsustek please post your /var/run/hostapd-phy*.conf files so I can try replicate this issue as too.

lucize commented 6 years ago

hostapd-phy0.conf.txt you can test with this

Chadster766 commented 6 years ago

Thanks I will test it and report back results.

thagabe commented 6 years ago

To all the people having these issues, have you tried re-downloading the source and rebuild the toolchain?

wongsyrone commented 6 years ago

Always compiling from source and using GCC 7.2

Chadster766 commented 6 years ago

@lucize I can't test the hostapd config lines below because they are strictly OpenWRT customization patches for mwlwifi or ieee80211:

Line 13: unknown configuration item 'ht_coex'
Line 17: unknown configuration item 'wpa_disable_eapol_key_retries'
lucize commented 6 years ago

those options shouldn't matter

ratsputin commented 6 years ago

Updating toolchain and rebuilding from source using gcc 6.3 every time there's a crash.

thagabe commented 6 years ago

maybe try GCC 7.2, that's what I've been using on Fedora, Ubuntu 16.04 and 17.10. I remember 6.X giving me bugs on debian which is why i moved to ubuntu 16.04

BrainSlayer commented 6 years ago

that doesnt makes a difference. dd-wrt uses 7.2 for instance and the same issue is present

aaron1959 commented 6 years ago

And I am experiencing this issue with GCC 7.2 dd-wrt.

The Issue only occurs under very specific circumstances, all listed above. Something needs to be fixed that only one of us has control over. These other ideas are great but to me it's like putting new tires on a car with a blown engine and constantly changing the tires hoping the car suddenly starts running.

thagabe commented 6 years ago

@BrainSlayer noted

ratsputin commented 6 years ago

@aaron1959 I believe the issue is that the problem isn't easy to reproduce universally. On my build, one device leaving for 1.5 hours and coming back will not crash the router anymore (it hasn't for the last couple of weeks). Only if two devices leave and return simultaneously does the router crash. Case-in-point, my router has been up since Tuesday morning with no issues, as only individual devices have left and returned. Both of us will be leaving this evening, so I expect the router will, in fact, crash.