greearb / ath10k-ct

Stand-alone ath10k driver based on Candela Technologies Linux kernel.
111 stars 40 forks source link

Firmware crash on turris when using stadia on a macbook #125

Closed tardyp closed 2 years ago

tardyp commented 4 years ago

This is a followup of the thread here: https://forum.turris.cz/t/kernel-module-crashed/8084/15

Description of the problem (how to configure, how to reproduce, how often it happens).

When playing stadia over the wifi, I get a firmware crash after a while. Stadia streams constantly at approx 3MB/s after 45min I get a crash of firmware. It might be a overheat, but I have not yet confirmed this ( I did setup hwsensor logging, but could not reproduce since)

Software (OS, Firmware version, kernel, driver, etc)

using latest version of TurrisOS, with 'ath10k-firmware-qca988x-ct-htt'

config: config wifi-device 'radio0' option type 'mac80211' option macaddr 'XXX' option country 'FR' option legacy_rates '1' option hwmode '11a' option channel 'auto' option disabled '0' option htmode 'VHT80'

config wifi-iface 'default_radio0' option device 'radio0' option network 'lan' option mode 'ap' option encryption 'psk2+ccmp' option wpa_group_rekey '86400' option key 'XXX' option disabled '0' option ssid 'XXX' option hidden '0'

Hardware (NIC chipset, platform, etc)

Turris Omnia

Logs (dmesg, maybe supplicant and/or hostap)

[72770.933998] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[72771.036395] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[72771.138789] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[72771.241182] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[72771.343577] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[72771.446091] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[72771.453680] ath10k_pci 0000:02:00.0: Cannot communicate with firmware, attempting to fake crash and restart firmware.
[72771.464357] ath10k_pci 0000:02:00.0: firmware crashed! (uuid 9f9a7051-0600-47ad-8545-dbd03876c73b)
[72771.473365] ath10k_pci 0000:02:00.0: qca988x hw2.0 target 0x4100016c chip_id 0x043202ff sub 0000:0000
[72771.482625] ath10k_pci 0000:02:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 0
[72771.491973] ath10k_pci 0000:02:00.0: firmware ver 10.1-ct-8x-__fH-021-4fa9f30 api 2 features wmi-10.x,txstatus-noack,wmi-10.x-CT,ratemask-CT,regdump-CT,txrate-CT,flush-all-CT,pingpong-CT,ch-regs-CT,nop-CT,set-special-CT,get-temp-CT,tx-rc-CT,cust-stats-CT crc32 580c6146
[72771.515903] ath10k_pci 0000:02:00.0: board_file api 1 bmi_id N/A crc32 bebc7c08
[72771.523251] ath10k_pci 0000:02:00.0: htt-ver 2.2 wmi-op 2 htt-op 2 cal otp max-sta 128 raw 0 hwcrypto 1
[72771.534723] ath10k_pci 0000:02:00.0: firmware register dump:
[72771.540425] ath10k_pci 0000:02:00.0: [00]: 0x00940750 0x00400C00 0x00980000 0x009B6074
[72771.548370] ath10k_pci 0000:02:00.0: [04]: 0x009B60E4 0x009B6208 0x00941B6C 0x00941B20
[72771.556329] ath10k_pci 0000:02:00.0: [08]: 0x00941B00 0x009423A4 0x009422DC 0x009422C4
[72771.564285] ath10k_pci 0000:02:00.0: [12]: 0x00941B90 0x009423F4 0x009423D4 0x009423DC
[72771.572230] ath10k_pci 0000:02:00.0: [16]: 0x009423E4 0x009423EC 0x00942888 0x00942520
[72771.580188] ath10k_pci 0000:02:00.0: [20]: 0x009424FC 0x0094241C 0x00942540 0x00942498
[72771.588133] ath10k_pci 0000:02:00.0: [24]: 0x00942450 0x0094250C 0x009424D0 0x00942750
[72771.596076] ath10k_pci 0000:02:00.0: [28]: 0x00942798 0x00942844 0x009428C4 0x00942594
[72771.604042] ath10k_pci 0000:02:00.0: [32]: 0x0094258C 0x00942C28 0x00942F4C 0x00942F7C
[72771.612001] ath10k_pci 0000:02:00.0: [36]: 0x00942F90 0x00942FE4 0x00942FF8 0x00943040
[72771.619967] ath10k_pci 0000:02:00.0: [40]: 0x0094077C 0x00940790 0x00943084 0x00942F00
[72771.627912] ath10k_pci 0000:02:00.0: [44]: 0x009B6234 0x00942A80 0x00942D40 0x00942D64
[72771.635859] ath10k_pci 0000:02:00.0: [48]: 0x00942D78 0x00942D9C 0x00957E18 0x00957E28
[72771.643815] ath10k_pci 0000:02:00.0: [52]: 0x00957E20 0x00940788 0x009430EC 0x00000000
[72771.651758] ath10k_pci 0000:02:00.0: [56]: 0x00000000 0x00000000 0x00000000 0x00000000
[72771.659707] ath10k_pci 0000:02:00.0: Copy Engine register dump:
[72771.665666] ath10k_pci 0000:02:00.0: [00]: 0x00057400   2   2   3   3
[72771.672136] ath10k_pci 0000:02:00.0: [01]: 0x00057800  22  22 215 218
[72771.678606] ath10k_pci 0000:02:00.0: [02]: 0x00057c00  27  27  24  27
[72771.685073] ath10k_pci 0000:02:00.0: [03]: 0x00058000   0   0   2   0
[72771.691550] ath10k_pci 0000:02:00.0: [04]: 0x00058400 669 669 120  80
[72771.698014] ath10k_pci 0000:02:00.0: [05]: 0x00058800   9   9 397 425
[72771.704491] ath10k_pci 0000:02:00.0: [06]: 0x00058c00  17  17  17  17
[72771.710961] ath10k_pci 0000:02:00.0: [07]: 0x00059000   0   0   0   0
[72771.719448] ath10k_pci 0000:02:00.0: debug log header, dbuf: 0x4125e4  dropped: 0
[72771.727966] ath10k_pci 0000:02:00.0: [0] next: 0x4125fc buf: 0x410448 sz: 1500 len: 0 count: 0 free: 0
[72771.738314] ath10k_pci 0000:02:00.0: [1] next: 0x4125e4 buf: 0x410a38 sz: 1500 len: 0 count: 0 free: 0
[72771.771827] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[72771.779167] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[72771.786510] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[72771.794237] ath10k_pci 0000:02:00.0: failed to set preamble for vdev 0: -11
[72771.802153] ath10k_pci 0000:02:00.0: failed to send wmi nop: -108
[72771.849944] ath10k_pci 0000:02:00.0: removing peer, cleanup-all, deleting: peer da823000 vdev: 0 addr: XXX
[72771.860956] ath10k_pci 0000:02:00.0: removing peer, cleanup-all, deleting: peer e8fa6c00 vdev: 0 addr: XXX
[72771.871956] ath10k_pci 0000:02:00.0: removing peer, cleanup-all, deleting: peer c483f400 vdev: 0 addr: XXX
[72771.882962] ath10k_pci 0000:02:00.0: removing peer, cleanup-all, deleting: peer ea67de00 vdev: 0 addr: XXX
[72771.893954] ath10k_pci 0000:02:00.0: removing peer, cleanup-all, deleting: peer ea153c00 vdev: 0 addr: XXX
[72771.904956] ath10k_pci 0000:02:00.0: removing peer, cleanup-all, deleting: peer ea153600 vdev: 0 addr: XXX
[72771.915986] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[72772.032151] ieee80211 phy0: Hardware restart was requested
[72773.059250] ath10k_pci 0000:02:00.0: 10.1 wmi init: vdevs: 16  peers: 127  tid: 256
[72773.076431] ath10k_pci 0000:02:00.0: wmi print 'P 128 V 8 T 410'
[72773.082548] ath10k_pci 0000:02:00.0: wmi print 'msdu-desc: 1424  sw-crypt: 0'
[72773.089716] ath10k_pci 0000:02:00.0: wmi print 'alloc rem: 26400 iram: 27140'
[72773.175365] ath10k_pci 0000:02:00.0: dropping dbg buffer due to crash since read
[72773.185125] ath10k_pci 0000:02:00.0: dropping dbg buffer due to crash since read
[72773.196916] ath10k_pci 0000:02:00.0: device successfully recovered
[72774.058500] ath10k_pci 0000:02:00.0: dropping dbg buffer due to crash since read
[72777.058817] ath10k_pci 0000:02:00.0: dropping dbg buffer due to crash since read
[72778.058831] ath10k_pci 0000:02:00.0: dropping dbg buffer due to crash since read
[72784.059184] ath10k_pci 0000:02:00.0: dropping dbg buffer due to crash since read
[72789.059276] ath10k_pci 0000:02:00.0: dropping dbg buffer due to crash since read
tardyp commented 4 years ago

crash dump:

crash-ath10k-ct.bin.zip

tardyp commented 4 years ago

Another crash from today.

crash-ath10k-ct.bin2.zip

[158158.600731] ath10k_pci 0000:02:00.0: Cannot communicate with firmware, attempting to fake crash and restart firmware.
[158158.611502] ath10k_pci 0000:02:00.0: firmware crashed! (uuid c256ed3f-ef27-478f-9f9f-d7e923a82416)
[158158.620585] ath10k_pci 0000:02:00.0: qca988x hw2.0 target 0x4100016c chip_id 0x043202ff sub 0000:0000
[158158.629935] ath10k_pci 0000:02:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 0
[158158.639355] ath10k_pci 0000:02:00.0: firmware ver 10.1-ct-8x-__fH-021-4fa9f30 api 2 features wmi-10.x,txstatus-noack,wmi-10.x-CT,ratemask-CT,regdump-CT,txrate-CT,flush-all-CT,pingpong-CT,ch-regs-CT,nop-CT,set-special-CT,get-temp-CT,tx-rc-CT,cust-stats-CT crc32 580c6146
[158158.663348] ath10k_pci 0000:02:00.0: board_file api 1 bmi_id N/A crc32 bebc7c08
[158158.670783] ath10k_pci 0000:02:00.0: htt-ver 2.2 wmi-op 2 htt-op 2 cal otp max-sta 128 raw 0 hwcrypto 1
[158158.682310] ath10k_pci 0000:02:00.0: firmware register dump:
[158158.688073] ath10k_pci 0000:02:00.0: [00]: 0x00940750 0x00400C00 0x00980000 0x009B6074
[158158.696117] ath10k_pci 0000:02:00.0: [04]: 0x009B60E4 0x009B6208 0x00941B6C 0x00941B20
[158158.704152] ath10k_pci 0000:02:00.0: [08]: 0x00941B00 0x009423A4 0x009422DC 0x009422C4
[158158.712192] ath10k_pci 0000:02:00.0: [12]: 0x00941B90 0x009423F4 0x009423D4 0x009423DC
[158158.720223] ath10k_pci 0000:02:00.0: [16]: 0x009423E4 0x009423EC 0x00942888 0x00942520
[158158.728255] ath10k_pci 0000:02:00.0: [20]: 0x009424FC 0x0094241C 0x00942540 0x00942498
[158158.736318] ath10k_pci 0000:02:00.0: [24]: 0x00942450 0x0094250C 0x009424D0 0x00942750
[158158.744364] ath10k_pci 0000:02:00.0: [28]: 0x00942798 0x00942844 0x009428C4 0x00942594
[158158.752411] ath10k_pci 0000:02:00.0: [32]: 0x0094258C 0x00942C28 0x00942F4C 0x00942F7C
[158158.760442] ath10k_pci 0000:02:00.0: [36]: 0x00942F90 0x00942FE4 0x00942FF8 0x00943040
[158158.768471] ath10k_pci 0000:02:00.0: [40]: 0x0094077C 0x00940790 0x00943084 0x00942F00
[158158.776513] ath10k_pci 0000:02:00.0: [44]: 0x009B6234 0x00942A80 0x00942D40 0x00942D64
[158158.784551] ath10k_pci 0000:02:00.0: [48]: 0x00942D78 0x00942D9C 0x00957E18 0x00957E28
[158158.792583] ath10k_pci 0000:02:00.0: [52]: 0x00957E20 0x00940788 0x009430EC 0x00000000
[158158.800627] ath10k_pci 0000:02:00.0: [56]: 0x00000000 0x00000000 0x00000000 0x00000000
[158158.808691] ath10k_pci 0000:02:00.0: Copy Engine register dump:
[158158.814755] ath10k_pci 0000:02:00.0: [00]: 0x00057400   2   2   3   3
[158158.821316] ath10k_pci 0000:02:00.0: [01]: 0x00057800  16  16 305 308
[158158.827875] ath10k_pci 0000:02:00.0: [02]: 0x00057c00  62  62  59  62
[158158.834453] ath10k_pci 0000:02:00.0: [03]: 0x00058000   7   7   9   7
[158158.841014] ath10k_pci 0000:02:00.0: [04]: 0x00058400 7599 7599   1 217
[158158.847748] ath10k_pci 0000:02:00.0: [05]: 0x00058800  29  29  78  93
[158158.854310] ath10k_pci 0000:02:00.0: [06]: 0x00058c00  27  27  27  27
[158158.860878] ath10k_pci 0000:02:00.0: [07]: 0x00059000   0   0   0   0
[158158.869444] ath10k_pci 0000:02:00.0: debug log header, dbuf: 0x4125e4  dropped: 0
[158158.878062] ath10k_pci 0000:02:00.0: [0] next: 0x4125fc buf: 0x410448 sz: 1500 len: 24 count: 1 free: 0
[158158.888603] ath10k_pci 0000:02:00.0: ath10k_pci ATH10K_DBG_BUFFER:
[158158.894907] ath10k: [0000]: 01A77736 13FC0432 9110DDDE 09A77736 09A76C5E 009B63D0
[158158.902520] ath10k_pci 0000:02:00.0: ATH10K_END
[158158.908168] ath10k_pci 0000:02:00.0: [1] next: 0x4125e4 buf: 0x410a38 sz: 1500 len: 0 count: 0 free: 0
[158158.941754] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[158158.949204] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[158158.956648] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[158158.964377] ath10k_pci 0000:02:00.0: failed to recalculate rts/cts prot for vdev 0: -11
[158158.972666] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[158158.980493] ath10k_pci 0000:02:00.0: failed to set cts protection for vdev 0: -108
[158158.988566] ath10k_pci 0000:02:00.0: failed to set preamble for vdev 0: -108
[158158.995858] ath10k_pci 0000:02:00.0: dropping dbg buffer due to crash since read
[158159.003402] ath10k_pci 0000:02:00.0: failed to send wmi nop: -108
[158159.010368] ath10k_pci 0000:02:00.0: failed to read temperature -108
[158159.046354] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
[158159.053815] ath10k_pci 0000:02:00.0: removing peer, cleanup-all, deleting: peer ea6a4e00 vdev: 0 addr: XXX
[158159.064898] ath10k_pci 0000:02:00.0: removing peer, cleanup-all, deleting: peer e6c25800 vdev: 0 addr: XXX
[158159.075979] ath10k_pci 0000:02:00.0: removing peer, cleanup-all, deleting: peer ec345600 vdev: 0 addr: XXX
[158159.087053] ath10k_pci 0000:02:00.0: removing peer, cleanup-all, deleting: peer ea6a4800 vdev: 0 addr: XXX
[158159.098125] ath10k_pci 0000:02:00.0: removing peer, cleanup-all, deleting: peer c0f84200 vdev: 0 addr: XXX
[158159.109198] ath10k_pci 0000:02:00.0: removing peer, cleanup-all, deleting: peer c0f84800 vdev: 0 addr: XXX
[158159.232964] ieee80211 phy0: Hardware restart was requested
[158160.279448] ath10k_pci 0000:02:00.0: 10.1 wmi init: vdevs: 16  peers: 127  tid: 256
[158160.296683] ath10k_pci 0000:02:00.0: wmi print 'P 128 V 8 T 410'
[158160.303168] ath10k_pci 0000:02:00.0: wmi print 'msdu-desc: 1424  sw-crypt: 0'
[158160.310439] ath10k_pci 0000:02:00.0: wmi print 'alloc rem: 26400 iram: 27140'
[158160.392233] ath10k_pci 0000:02:00.0: dropping dbg buffer due to crash since read
[158160.401768] ath10k_pci 0000:02:00.0: dropping dbg buffer due to crash since read
[158160.410192] ath10k_pci 0000:02:00.0: device successfully recovered

image

Does not look like the temperature is too important factor. I did try to repro with an iperf3 load for ~50min (4MB/s), but I wasn't able to get any crash. then I started stadia, and got a crash after ~20 min.

ackstorm23 commented 3 years ago

I've been encountering this as well lately

Jul 30 09:21:36 turris kernel: [486097.267028] ath10k_warn: 94 callbacks suppressed
Jul 30 09:21:36 turris kernel: [486097.267036] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 2, skipped old beacon
Jul 30 09:21:36 turris kernel: [486097.301236] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
Jul 30 09:21:37 turris kernel: [486097.335304] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 1, skipped old beacon
Jul 30 09:21:37 turris kernel: [486097.369541] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 2, skipped old beacon
Jul 30 09:21:37 turris kernel: [486097.403686] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
Jul 30 09:21:37 turris kernel: [486097.437760] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 1, skipped old beacon
Jul 30 09:21:37 turris kernel: [486097.471818] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 2, skipped old beacon
Jul 30 09:21:37 turris kernel: [486097.505951] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 0, skipped old beacon
Jul 30 09:21:37 turris kernel: [486097.540088] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 1, skipped old beacon
Jul 30 09:21:37 turris kernel: [486097.574296] ath10k_pci 0000:02:00.0: SWBA overrun on vdev 2, skipped old beacon
Jul 30 09:21:41 turris kernel: [486102.004745] ath10k_pci 0000:02:00.0: device successfully recovered

No significant increase in chipset temperatures for me, either.

The only factor seemed to be higher than average network throughput at the time.

greearb commented 2 years ago

wave-1 radios lock up, I don't know how to fix it, and it may be hardware/pci related bugs. Been there forever. The best I can do is to fake a crash and restart the firmware. That is what at least some of these logs show. Closing this since I don't know how to make any more progress in this area.