kaloz / mwlwifi

mac80211 driver for the Marvell 88W8864 802.11ac chip
396 stars 119 forks source link

5GHz WiFi drops with error cmd 0x9125=BAStream timed out #133

Closed gsustek closed 7 years ago

gsustek commented 7 years ago

Hi on LEDE, i got this error after normal surfing, i have only iPad and iPhone connected to 5GHz Wifi.

http://pastebin.com/nEzDquTe

thagabe commented 7 years ago

Could you provide:

gsustek commented 7 years ago

root@1900acs:~# cat /sys/kernel/debug/ieee80211/phy0/mwlwifi/info

driver name: mwlwifi chip type: 88W8864 hw version: 7 driver version: 10.3.2.0-20161222 firmware version: 0x0702091a power table loaded from dts: yes firmware region code: 0x0 mac address: 00:25:9c:13:b4:b0 2g: disable 5g: enable antenna: 4 4 irq number: 104 iobase0: e0f80000 iobase1: e1100000 tx limit: 768 rx limit: 64 ap macid support: 0000ffff sta macid support: 00010000 macid used: 00000001 qe trigger number: 17331543

root@1900acs:~# root@1900acs:~# uname -a Linux 1900acs 4.4.39 #0 SMP Wed Dec 28 09:35:02 2016 armv7l GNU/Linux oot@1900acs:~# cat /etc/openwrt_release DISTRIB_ID='Lede-1900acs' DISTRIB_RELEASE='SNAPSHOT' DISTRIB_REVISION='r2709-b7677f0' DISTRIB_CODENAME='reboot' DISTRIB_TARGET='mvebu/generic' DISTRIB_DESCRIPTION='Lede-1900acs Reboot SNAPSHOT r2709-b7677f0' DISTRIB_TAINTS='no-all busybox' root@1900acs:~#

yuhhaurlin commented 7 years ago

This kind of problem had been reported from time to time. I wonder if you can run pre-built image built by community? Or let me know the way to build the image you tested? I will try to build the image and verify it.

yuhhaurlin commented 7 years ago

Does anyone encounter the same problem? If yes, please let me know which commit of OpenWrt or LEDE you used to build your image and let me know the test environment you used to reproduce your problem. Thanks.

ghost commented 7 years ago

I can confirm the problem, but I do not know what triggered it:

Jan 10 16:08:36 OpenWrt hostapd: wlan0: STA bc:76:XX IEEE 802.11: authenticated Jan 10 16:08:37 OpenWrt hostapd: wlan0: STA bc:76:XX IEEE 802.11: associated (aid 1) Jan 10 16:08:37 OpenWrt hostapd: wlan0: STA bc:76::XX RADIUS: starting accounting session D5AE3AEB4AB7958E Jan 10 16:08:37 OpenWrt hostapd: wlan0: STA bc:76:XX WPA: pairwise key handshake completed (RSN) Jan 10 16:09:10 OpenWrt kernel: [28779.526410] ieee80211 phy0: cmd 0x9125=BAStream timed out Jan 10 16:09:10 OpenWrt kernel: [28779.531847] ieee80211 phy0: return code: 0x1125 Jan 10 16:09:10 OpenWrt kernel: [28779.536399] ieee80211 phy0: timeout: 0x1125 Jan 10 16:09:10 OpenWrt kernel: [28779.540613] ieee80211 phy0: destroy ba failed execution Jan 10 16:10:00 OpenWrt crond[1521]: USER root pid 3309 cmd /sbin/fan_ctrl.sh Jan 10 16:10:03 OpenWrt kernel: [28832.231404] ieee80211 phy0: cmd 0x9122=UpdateEncryption timed out Jan 10 16:10:03 OpenWrt kernel: [28832.237533] ieee80211 phy0: return code: 0x1122 Jan 10 16:10:03 OpenWrt kernel: [28832.242081] ieee80211 phy0: timeout: 0x1122 Jan 10 16:10:03 OpenWrt kernel: [28832.246279] ieee80211 phy0: failed execution Jan 10 16:10:03 OpenWrt kernel: [28832.250571] wlan0: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-5) Jan 10 16:10:07 OpenWrt kernel: [28836.257323] ieee80211 phy0: cmd 0x9122=UpdateEncryption timed out Jan 10 16:10:07 OpenWrt kernel: [28836.263449] ieee80211 phy0: return code: 0x1122 Jan 10 16:10:07 OpenWrt kernel: [28836.268010] ieee80211 phy0: timeout: 0x1122 Jan 10 16:10:07 OpenWrt kernel: [28836.272213] ieee80211 phy0: failed execution

At this point, the Wifi network can no longer be detected by clients.

ghost commented 7 years ago

oh, and I forgot: Lede Version: Goliath IV (CURRENT, r2446-9791fb2)

root@OpenWrt:~# cat /sys/kernel/debug/ieee80211/phy0/mwlwifi/info

driver name: mwlwifi chip type: 88W8964 hw version: 7 driver version: 10.3.2.0-20161124 firmware version: 0x07080004 power table loaded from dts: no firmware region code: 0x10 mac address: 60:38:XX 2g: disable 5g: enable antenna: 4 4 irq number: 105 iobase0: e1000000 iobase1: e1280000 tx limit: 768 rx limit: 64 ap macid support: 0000ffff sta macid support: 00010000 macid used: 00000001 qe trigger number: 20973

root@OpenWrt:~# uname -a Linux OpenWrt 4.4.36 #0 SMP Fri Dec 9 16:16:43 2016 armv7l GNU/Linux

yuhhaurlin commented 7 years ago

Can you let me know how did you build the code and how did you encounter this problem?

yuhhaurlin commented 7 years ago

I think I will build image from latest code of LEDE main trunk (https://git.lede-project.org/source.git) and do some stress tests to see if I can reproduce this problem. If anyone knows exactly the way to reproduce this problem, please let me know. Thanks.

yuhhaurlin commented 7 years ago

I just build image from LEDE main trunk(https://git.lede-project.org/source.git: commit 96a9403 tools: libressl: always build as PIC). Please use the image to test on your WRT1900ACS. If you need image for other devices, please let me know.

WRT1900ACS: https://drive.google.com/open?id=0B3qLWtcWB9EdazNGNi1nZG5HOGM

root@lede:/# cat /sys/kernel/debug/ieee80211/phy0/mwlwifi/info

driver name: mwlwifi chip type: 88W8864 hw version: 7 driver version: 10.3.2.0-20161222 firmware version: 0x0702091a power table loaded from dts: no firmware region code: 0x10 mac address: 00:50:43:21:bd:e9 2g: disable 5g: enable antenna: 4 4 irq number: 105 iobase0: f0e00000 iobase1: f0f80000 tx limit: 768 rx limit: 64 ap macid support: 0000ffff sta macid support: 00010000 macid used: 00000000 qe trigger number: 0

root@lede:/# uname -a Linux lede 4.4.40 #0 SMP Tue Jan 10 21:15:37 2017 armv7l GNU/Linux root@lede:/#

gsustek commented 7 years ago

for me it happend during scroll of youtube video on iPad Air 2, i use openssl, not libressl. do you want my diffconfig?

yuhhaurlin commented 7 years ago

Can you try the image I just posted? I hope we can test on the same version of the code, It would be easier for me to check problem. Thanks.

gsustek commented 7 years ago

i have the same lede commit, but really heavy .config...i could test it but i need to have my .config during build..

yuhhaurlin commented 7 years ago

Can you try the image first? I think other upper layer configurations should not affect the function of WiFi.

gsustek commented 7 years ago

i will try, please explain, what is different from vanilla build commit 96a9403 from your build?

yuhhaurlin commented 7 years ago

I get latest source code form https://git.lede-project.org/source.git and this is the latest commit. My previous build for LEDE is around October/2016. I hope we can start from the latest commit and check if this problem will happen on it.

yuhhaurlin commented 7 years ago

Another way: you can tell me how did you build your image. I can follow your way to build the image and follow your setting to reproduce this problem.

gsustek commented 7 years ago

here is .mydiffconfig, compiled on ubuntu 16.10 kernel 4.9.2.

http://s000.tinyupload.com/index.php?file_id=58464231352945797284

yuhhaurlin commented 7 years ago

I have problem to download the file.

gsustek commented 7 years ago

try this one: http://www.filedropper.com/mydiffconfigprintingtar

ghost commented 7 years ago

I am using davidc's Lede builds from https://davidc502sis.dynamic-dns.net/index.shtml for the 3200acm.

yuhhaurlin commented 7 years ago

@bkobi WRT3200ACM has problem.

yuhhaurlin commented 7 years ago

@gsustek I got your diff for the configuration. I will try to configure my LEDE as yours and do tests. BTW, can you confirm if you will encounter this problem for the image I just posted. Thanks.

kberanek commented 7 years ago

I'm seeing what looks like same issue with my own builds on a 1900ACS. I sent an email to the lede-dev mailing list with logs that you can find here: http://lists.infradead.org/pipermail/lede-dev/2017-January/005249.html

I'd be glad to provide any additional information that might be helpful. I'm currently in the process of trying a new build using LEDE commit 1ad30be982e953a36e4677d3022248a962f039b9.

yuhhaurlin commented 7 years ago

@kberanek

  1. Can you try the image I sent out first.
  2. Do you remember previous version of mwlwifi driver which runs on your stable LEDE version? If yes, can you build that version of driver and run on your current unstable version of LEDE to see if you will encounter the same problem?

Thanks.

kberanek commented 7 years ago

@yuhhaurlin Thanks for the quick response.

I don't know for sure, but based on my recollection of having built the working image in early-to-mid December I assume the working build used 10.3.2.0-20161013 based on the LEDE commit history.

I just flashed an image using 10.3.2.0-20170110 and I'll let that run for a bit to see how it does. If it doesn't seem to fix things then, yes, I will try reverting back to 10.3.2.0-20161013 and see how that goes. I've discovered that streaming on my FireTV stick seems to be the main thing that triggers this issue, so hopefully I should know relatively quickly if the new build fixes things. I will, however; be travelling the next two days so I probably won't be able to provide any feedback until this weekend, but I will definitely provide an update this weekend.

Re #1: I'm sorry, but I'm not willing to run arbitrary images. Commit 96a9403 looks like it would contain the same version of mwlwifi that I have definitely had issues with (this is the version the logs I linked to earlier came from), but if you'd like me to try that commit specifically I can definitely build an image from that commit id.

yuhhaurlin commented 7 years ago

@kberanek Thanks.

yuhhaurlin commented 7 years ago

@gsustek I base on your diff configuration file to configure my LEDE, please help to check if this configuration file includes all your packages. If yes, I will build image and let you reproduce your problem on it.

https://drive.google.com/open?id=0B3qLWtcWB9EdYmN2Uk8zc3VLVTQ

thagabe commented 7 years ago

@yuhhaurlin I vaguely recall building on Ubuntu (other than LTS) would sometimes introduce bugs or errors to the build. As a rule of thumb building LEDE but mainly OpenWRT should be done on Ubuntu 14.04 or 16.04 or Debian Jessie (Wheezy is kinda old)

kberanek commented 7 years ago

The image with 10.3.2.0-20170110 worked fine for a couple days and I actually even found the original good build from earlier. It turns out it was using 10.3.2.0-20161011. I then tried flashing a variety of images including the previous known-good version and every one of them exhibits this new strange behavior where it works fine for a short period of time and then wifi clients gradually see reduced throughput and eventually lose all internet connectivity while ethernet-attached clients seem to work fine. I can't find any interesting logs either.

There's obviously something else going on that's not related to the version of mwlwifi or LEDE, but I haven't been able to figure out what that might be.

yuhhaurlin commented 7 years ago

@kberanek Does you mean you have a version of LEDE and no matter what version of mwlwifi running on it, you will encounter this problem? If yes, can you let me know how did you build your LEDE image (source tree and which commit you used to build your image)? BTW, please also let me know what device used by you. I want to follow your way to create LEDE image and let you reproduce this problem on your device. Then I can try to reproduce this problem on my DB board and check this problem. Thanks.

gsustek commented 7 years ago

@yuhhaurlin here is my full config, http://www.filedropper.com/configmwlwifitar just use it.

yuhhaurlin commented 7 years ago

@gsustek It is not the same as the previous diff configuration file you sent to me. From previous diff configuration file you use CONFIG_BINUTILS_USE_VERSION_2_27=y and CONFIG_GCC_USE_VERSION_6=y. But this configuration file uses old version. I almost complete the build and basic tests for the image I built based on your previous diff configuration file. I will send it out for you to reproduce this problem. And I will ignore this configuration file. Thanks.

yuhhaurlin commented 7 years ago

@gsustek Please help to reproduce the problem on following image and let me know your setting and the way to reproduce the problem. Thanks.

WRT1900ACS: https://drive.google.com/open?id=0B3qLWtcWB9EdazNGNi1nZG5HOGM

root@lede:/# cat /sys/kernel/debug/ieee80211/phy0/mwlwifi/info

driver name: mwlwifi chip type: 88W8864 hw version: 7 driver version: 10.3.2.0-20161222 firmware version: 0x0702091a power table loaded from dts: no firmware region code: 0x10 mac address: 00:50:43:21:bd:e9 2g: disable 5g: enable antenna: 4 4 irq number: 105 iobase0: f0e80000 iobase1: f1000000 tx limit: 768 rx limit: 64 ap macid support: 0000ffff sta macid support: 00010000 macid used: 00000001 qe trigger number: 4507

root@lede:/# uname -a Linux lede 4.4.40 #0 SMP Tue Jan 10 21:15:37 2017 armv7l GNU/Linux

gsustek commented 7 years ago

@yuhhaurlin i always use last bin_utils and last GCC which LEDE provides. It seems that maybe command "run scripts/diffconfig.sh > mydiffconfig" didn't pickup these items....

what did you change in this firmware that i need to test? some additional debug?

yuhhaurlin commented 7 years ago

@gsustek No. I just make sure my configuration file is the same as yours (based on previous diff configuration file, not the latest full configuration file). If you can reproduce the problem on this image, I will base on your setting and test to reproduce the problem on my DB board. With DB board, I can try to check from firmware side to see if I can fix this problem.

gsustek commented 7 years ago

@yuhhaurlin Ok, i will test that firmware, but i must warn you that it could take some time for error to reaper.

yuhhaurlin commented 7 years ago

@gsustek All right. You can just try to make sure the problem will happen, so I can base on your setting and test to reproduce this problem on my DB board. I don't want to see this kind of problem reported from time to time. Thanks for your help.

kberanek commented 7 years ago

@yuhhaurlin I have a WRT1900ACS.

I've included a table of the various configurations that I've run along with results below. I seem to have found a stable build (LEDE b9a408c2b49ccfa0e906bda00ef77f4002e401fd, mwlwifi 10.3.2.0-20170110) with the exception that it doesn't seem to work reliably with multiple ssids on the 5GHz radio (not sure about 2.4GHz).

The really strange thing to me is the behavior of tests 4/5 and the ones missing from the list that I didn't record details for. Even the same image that worked in test 1 didn't work in test 5. I'm not sure how to explain that. This weirdness is what I was trying to describe in my post yesterday.

All of the tests before test 6 were using multiple ssids on the 5GHz radio and a fairly large number of packages. Test 6-7 just had a single ssid and worked fine, test 8 used multiple ssids and had issues with clients being able to see either ssid.

Based on these experiments, it seems like there's probably something besides just the mwlwifi version that's affecting things. Also, I see a fair number of recent mac80211 changes, so maybe those are related to the recent builds being relatively stable (minus the multiple ssid issues). It would also seem that multiple ssids on the same radio has regressed because it definitely used to work without issues, but now it's really hit or miss for most clients and doesn't work at all for one client. I've included the /etc/config/wireless for tests 6/7/8 below as well.

For some reason I haven't been able to build another image from fd718c50256536ed082852d735dc7690d7604418, but that's the only version I've seen this error message from. Maybe someone else could repro this issue with that commit.

test #  LEDE commit                                 mwlwifi version         notes
-----------------------------------------------------------------------------------------
1       93715427835e747f0e0b348c8a3ce91dd68ef4f9    10.3.2.0-20161011       worked without issues for ~4 weeks until upgraded
2       fd718c50256536ed082852d735dc7690d7604418    10.3.2.0-20161222       only build that produced the 'cmd 0x9125=BAStream timed out' error
3       1ad30be982e953a36e4677d3022248a962f039b9    10.3.2.0-20170110       worked without issues for ~3 days until upgraded
4       1ad30be982e953a36e4677d3022248a962f039b9    10.3.2.0-20161013       wifi performance slowly degrades
5       93715427835e747f0e0b348c8a3ce91dd68ef4f9    10.3.2.0-20161011       wifi performance slowly degrades (same image as test 1)
    ... flashed a few other combinations - they all exhibited the slow degradation of wifi issue ...
6       b9a408c2b49ccfa0e906bda00ef77f4002e401fd    10.3.2.0-20170110       very minimal config - stable
7       b9a408c2b49ccfa0e906bda00ef77f4002e401fd    10.3.2.0-20170110       same as test 5 plus dnscrypt-proxy
8       b9a408c2b49ccfa0e906bda00ef77f4002e401fd    10.3.2.0-20170110       same image as test 6 with multiple ssids on 5GHz radio - clients had trouble seeing either ssid

------------------------
diffconfig for test 6:
------------------------
CONFIG_TARGET_mvebu=y
CONFIG_TARGET_mvebu_DEVICE_linksys-wrt1900acs=y
CONFIG_TARGET_BOARD="mvebu"

------------------------
diffconfig for test 7/8:
------------------------
CONFIG_TARGET_mvebu=y
CONFIG_TARGET_mvebu_DEVICE_linksys-wrt1900acs=y
CONFIG_TARGET_BOARD="mvebu"
CONFIG_LIBSODIUM_MINIMAL=y
CONFIG_PACKAGE_dnscrypt-proxy=y
CONFIG_PACKAGE_dnscrypt-proxy-resolvers=y
CONFIG_PACKAGE_libsodium=y

------------------------
/etc/config/wireless
------------------------
config wifi-device 'radio0'
    option type 'mac80211'
    option hwmode '11a'
    option path 'soc/soc:pcie-controller/pci0000:00/0000:00:01.0/0000:01:00.0'
    option htmode 'VHT80'
    option country 'US'
    option txpower '23'
    option channel 'auto'

config wifi-iface 'default_radio0'
    option device 'radio0'
    option network 'lan'
    option mode 'ap'
    option macaddr 'XX:XX:XX:XX:XX:XX'
    option encryption 'psk2+ccmp'
    option ssid 'foo'
    option key 'password'

config wifi-device 'radio1'
    option type 'mac80211'
    option channel '11'
    option hwmode '11g'
    option path 'soc/soc:pcie-controller/pci0000:00/0000:00:02.0/0000:02:00.0'
    option htmode 'HT20'
    option country 'US'
    option txpower '30'

config wifi-iface 'default_radio1'
    option device 'radio1'
    option mode 'ap'
    option macaddr 'XX:XX:XX:XX:XX:XX'
    option encryption 'psk2+ccmp'
    option key 'password'
    option ssid 'guest'
    option network 'guest'
    option isolate 1

### Uncommenting this causes problems
#config wifi-iface
#   option device 'radio0'
#   option mode 'ap'
#   option ssid 'bar'
#   option network 'lan'
#   option encryption 'psk2+ccmp'
#   option key 'password'
yuhhaurlin commented 7 years ago

@kberanek Thanks for your information. From your test result, the problem only happened with commit fd718c50256536ed082852d735dc7690d7604418.

@gsustek If you can't reproduce the problem on the image I just posted. Maybe you can build latest code to see if you will still encounter problem. Thanks.

gsustek commented 7 years ago

@yuhhaurlin OK, i will build then from latest commit.

kevinjos commented 7 years ago

I am running into the same kern.err on a WRT3200ACM with LEDE Reboot (17.01.0-rc1, r3042-ec095b5). My setup includes 2 SSIDs per antenna on isolated networks. I notice the 5gHz antenna goes down about every 24hrs. I am then able to connect to the 2.4gHz SSID for a few seconds before being disconnected. On occasion it is sufficient time to reboot the router. The eth0 interface appears unaffected.

Fri Feb  3 00:40:57 2017 kern.err kernel: [38159.651041] ieee80211 phy0: cmd 0x9125=BAStream timed out
Fri Feb  3 00:40:57 2017 kern.err kernel: [38159.656471] ieee80211 phy0: return code: 0x1125
Fri Feb  3 00:40:57 2017 kern.err kernel: [38159.661020] ieee80211 phy0: timeout: 0x1125
Fri Feb  3 00:40:57 2017 kern.err kernel: [38159.665223] ieee80211 phy0: destroy ba failed execution
Fri Feb  3 00:41:28 2017 kern.err kernel: [38189.998895] ieee80211 phy0: cmd 0x801d=MEMAddrAccess timed out
Fri Feb  3 00:41:28 2017 kern.err kernel: [38190.004769] ieee80211 phy0: return code: 0x001d
Fri Feb  3 00:41:28 2017 kern.err kernel: [38190.009319] ieee80211 phy0: timeout: 0x001d
Fri Feb  3 00:41:28 2017 kern.err kernel: [38190.013527] ieee80211 phy0: failed execution
Fri Feb  3 00:41:32 2017 kern.err kernel: [38194.016866] ieee80211 phy0: cmd 0x801d=MEMAddrAccess timed out
Fri Feb  3 00:41:32 2017 kern.err kernel: [38194.022746] ieee80211 phy0: return code: 0x001d
Fri Feb  3 00:41:32 2017 kern.err kernel: [38194.027312] ieee80211 phy0: timeout: 0x001d
Fri Feb  3 00:41:32 2017 kern.err kernel: [38194.031513] ieee80211 phy0: failed execution
Fri Feb  3 00:41:36 2017 kern.err kernel: [38198.034840] ieee80211 phy0: cmd 0x801d=MEMAddrAccess timed out
Fri Feb  3 00:41:36 2017 kern.err kernel: [38198.040701] ieee80211 phy0: return code: 0x001d
Fri Feb  3 00:41:36 2017 kern.err kernel: [38198.045259] ieee80211 phy0: timeout: 0x001d
Fri Feb  3 00:41:36 2017 kern.err kernel: [38198.049473] ieee80211 phy0: failed execution
thagabe commented 7 years ago

@kevinjos 3200ACM is confirm to have problems, no need to troubleshoot it further as the problem has been reproduced and the fix is being worked on, patience is key.

kevinjos commented 7 years ago

Thanks @thagabe! Where will the fix be announced?

akorolyov commented 7 years ago

Thanks, @thagabe, I am waiting for the solution. Patience is key =)

thagabe commented 7 years ago

@akorolyov @kevinjos Re-architecturing of the driver is done just linking it to the stock firmware! And we might have a working driver for the wrt3200acm

akorolyov commented 7 years ago

@thagabe appreciate your work, thank you very much!

thagabe commented 7 years ago

@akorolyov thank @yuhhaurlin not me haha

akorolyov commented 7 years ago

@yuhhaurlin and @thagabe thank you! I've bought a new powerful and cool router and it fail few times per day. Waiting for fix...

jklap commented 7 years ago

@thegabe -- is the fix you mentioned applicable to all devices or just the WRT3200ACM?

thagabe commented 7 years ago

@jklap Well since the restructuring of the Driver (this repo's code) is for all devices it should apply to all devices but the underlying firmware will be shared with the closed source driver so development efforts will be on par and bug hunting will benefit both. If firmware works fine for example with DFS then the new driver should be able to work equally well.