greearb / ath10k-ct

Stand-alone ath10k driver based on Candela Technologies Linux kernel.
110 stars 40 forks source link

firmware ver 10.1-ct-8x-__fH-022-ecad3248 crashed #180

Closed stintel closed 2 years ago

stintel commented 3 years ago

Hi Ben,

We've briefly talked about this one on IRC before.

< greearb> stintel, those wave-1 crashes are some sort of mem corruption in the firmware descriptors, you have hit same problem before. In case the new firmware is obviously worse than a previous one, then maybe something we can look out, but may also just random reproduction of previous issue

This seems to be happening relatively often now, whereas last year, it happened only very rarely. Happening on OpenWrt master r16423-bcdf600fc5 on a D-Link DAP-2695 (ath79).

[69985.299933] ath10k_pci 0000:00:00.0: firmware crashed! (guid ad381c0d-8dd6-4bcd-b821-529b5b7832d0)
[69985.309118] ath10k_pci 0000:00:00.0: qca988x hw2.0 target 0x4100016c chip_id 0x043202ff sub 0000:0000
[69985.318538] ath10k_pci 0000:00:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 0
[69985.330889] ath10k_pci 0000:00:00.0: firmware ver 10.1-ct-8x-__fH-022-ecad3248 api 2 features wmi-10.x,mfp,txstatus-noack,wmi-10.x-CT,ratemask-CT,regdump-CT,txrate-CT,flush-all-CT,pingpong-CT,ch-regs-CT,nop-CT,set-special-CT,get-temp-CT,tx-rc-CT,cust-stats-CT,retry-gt2-CT,txrate2-CT,beacon-cb-CT,wmi-block-ack-CT crc32 1b2a161c
[69985.360426] ath10k_pci 0000:00:00.0: board_file api 1 bmi_id N/A crc32 bebc7c08
[69985.367891] ath10k_pci 0000:00:00.0: htt-ver 2.2 wmi-op 2 htt-op 2 cal file max-sta 128 raw 0 hwcrypto 1
[69985.377676] ath10k_pci 0000:00:00.0: firmware register dump:
[69985.383526] ath10k_pci 0000:00:00.0: [00]: 0x4100016C 0x000015B3 0x009948FA 0x00955B31
[69985.391604] ath10k_pci 0000:00:00.0: [04]: 0x009948FA 0x00060330 0x00000005 0x00000051
[69985.399655] ath10k_pci 0000:00:00.0: [08]: 0x00000000 0x009BABEC 0x00426588 0xDEADC0DE
[69985.407867] ath10k_pci 0000:00:00.0: [12]: 0x00000009 0x00000000 0x0099485C 0x00994869
[69985.416191] ath10k_pci 0000:00:00.0: [16]: 0x00958080 0x0094085D 0x00000000 0x00000000
[69985.424380] ath10k_pci 0000:00:00.0: [20]: 0x409948FA 0x0040AC34 0x00000000 0x00410840
[69985.432488] ath10k_pci 0000:00:00.0: [24]: 0x809950AF 0x0040AC94 0x00000000 0xC09948FA
[69985.440657] ath10k_pci 0000:00:00.0: [28]: 0x809960A0 0x0040AD14 0x0041E524 0x00426588
[69985.448713] ath10k_pci 0000:00:00.0: [32]: 0x80997EC9 0x0040ADB4 0x0041E524 0x00426588
[69985.456768] ath10k_pci 0000:00:00.0: [36]: 0x809AF249 0x0040AEE4 0x00425918 0x00000002
[69985.464896] ath10k_pci 0000:00:00.0: [40]: 0x80940F18 0x0040AF14 0x00000004 0x004039D0
[69985.473000] ath10k_pci 0000:00:00.0: [44]: 0x80940EEA 0x0040AF44 0x00400000 0x00000000
[69985.481069] ath10k_pci 0000:00:00.0: [48]: 0x80940F31 0x0040AF64 0x00401C10 0x00400600
[69985.489184] ath10k_pci 0000:00:00.0: [52]: 0x40940024 0x0040AF84 0x004068E8 0x004068E8
[69985.497342] ath10k_pci 0000:00:00.0: [56]: 0x00000000 0x0040AFA4 0x009BB001 0x00040020
[69985.505437] ath10k_pci 0000:00:00.0: Copy Engine register dump:
[69985.511481] ath10k_pci 0000:00:00.0: [00]: 0x00057400   7   7   3   3
[69985.518042] ath10k_pci 0000:00:00.0: [01]: 0x00057800   6   6 328 329
[69985.524609] ath10k_pci 0000:00:00.0: [02]: 0x00057c00  53  53 116 117
[69985.531162] ath10k_pci 0000:00:00.0: [03]: 0x00058000   8   8  12   8
[69985.537716] ath10k_pci 0000:00:00.0: [04]: 0x00058400 813 813 192 152
[69985.544264] ath10k_pci 0000:00:00.0: [05]: 0x00058800  14  14 365 366
[69985.550811] ath10k_pci 0000:00:00.0: [06]: 0x00058c00  29  29  29  29
[69985.557358] ath10k_pci 0000:00:00.0: [07]: 0x00059000   0   0   0   0
[69985.564184] ath10k_pci 0000:00:00.0: debug log header, dbuf: 0x411a98  dropped: 0
[69985.571892] ath10k_pci 0000:00:00.0: [0] next: 0x411ab0 buf: 0x40f8fc sz: 1500 len: 108 count: 4 free: 0
[69985.581611] ath10k_pci 0000:00:00.0: ath10k_pci ATH10K_DBG_BUFFER:
[69985.587907] ath10k: [0000]: 50A74504 514CFC17 43841091 28004900 948E4200 2800C90E 00000000 50A74504
[69985.597127] ath10k: [0008]: 514CFC17 41841091 DEC0ADDE DEC0ADDE 28004900 40084100 50A74504 514CFC13
[69985.606338] ath10k: [0016]: 42841091 50830000 05000000 00000000 50A74504 0100FC17 B3150000 B3150000
[69985.615530] ath10k: [0024]: 24AB4000 6C010041 00000000
[69985.620787] ath10k_pci 0000:00:00.0: ATH10K_END
[69985.625482] ath10k_pci 0000:00:00.0: [1] next: 0x411a98 buf: 0x40feec sz: 1500 len: 0 count: 0 free: 0
[69986.004648] ath10k_pci 0000:00:00.0: removing peer, cleanup-all, deleting: peer b37d8ce1 vdev: 1 addr: a8:66:7f:1e:67:d6
[69986.015790] ath10k_pci 0000:00:00.0: removing peer, cleanup-all, deleting: peer 80ed3d41 vdev: 1 addr: 9e:d6:43:2d:b8:18
[69986.026934] ath10k_pci 0000:00:00.0: removing peer, cleanup-all, deleting: peer bd6d9e9d vdev: 0 addr: 9c:d6:43:2d:b8:18
[69986.157604] ieee80211 phy0: Hardware restart was requested
[69987.097138] ath10k_pci 0000:00:00.0: 10.1 wmi init: vdevs: 16  peers: 127  tid: 256
[69987.113940] ath10k_pci 0000:00:00.0: wmi print 'P 128 V 8 T 410'
[69987.120281] ath10k_pci 0000:00:00.0: wmi print 'msdu-desc: 1424  sw-crypt: 0 ct-sta: 0'
[69987.128446] ath10k_pci 0000:00:00.0: wmi print 'alloc rem: 25000 iram: 38944'
[69987.192493] ath10k_pci 0000:00:00.0: pdev param 0 not supported by firmware
[69987.206727] ath10k_pci 0000:00:00.0: rts threshold -1
[69987.212467] ath10k_pci 0000:00:00.0: rts threshold -1
[69987.228339] ath10k_pci 0000:00:00.0: failed to transmit packet, dropping: -19
[69987.235635] ath10k_pci 0000:00:00.0: failed to submit frame: -19
[69987.241757] ath10k_pci 0000:00:00.0: failed to transmit frame: -19
[69987.248276] ath10k_pci 0000:00:00.0: device successfully recovered
root@ap0:~# ethtool -i wl5-lan
driver: ath10k_pci
version: 5.4.109
firmware-version: 10.1-ct-8x-__fH-022-ecad3248
expansion-rom-version:
bus-info: 0000:00:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

To give an idea of how often it happens now (and probably incomplete due to limited buffer):

root@ap0:~# dmesg | grep "firmware crashed"
[48622.481161] ath10k_pci 0000:00:00.0: firmware crashed! (guid 88a31862-6afb-4691-85cb-67a0aa748918)
[48646.544511] ath10k_pci 0000:00:00.0: firmware crashed! (guid 4c3a22c6-98af-471c-87b2-cab82453709d)
[48665.698011] ath10k_pci 0000:00:00.0: firmware crashed! (guid 3c79e8c9-ccd6-434e-9ad5-fa9ce1b7e0cd)
[50792.472918] ath10k_pci 0000:00:00.0: firmware crashed! (guid f0866aaf-51e2-43e1-a5bb-796a8bab201e)
[51130.881619] ath10k_pci 0000:00:00.0: firmware crashed! (guid 79ffc0f3-b7dc-43ef-8341-40be87ebacb2)
[57037.635405] ath10k_pci 0000:00:00.0: firmware crashed! (guid 048b84ee-e517-46e5-84c2-8be54c8aca08)
[57059.317994] ath10k_pci 0000:00:00.0: firmware crashed! (guid 6f5950e1-4015-41c3-83c9-8cddffde2421)
[57077.899492] ath10k_pci 0000:00:00.0: firmware crashed! (guid 495f01b9-86d5-45ed-b55b-c9ec4597e866)
[57881.010454] ath10k_pci 0000:00:00.0: firmware crashed! (guid af185605-0275-4f12-a4db-027ca80f5bc5)
[61527.354332] ath10k_pci 0000:00:00.0: firmware crashed! (guid 8e1a0517-90ba-4927-96a7-7a75efdfdd4c)
[61552.837602] ath10k_pci 0000:00:00.0: firmware crashed! (guid 4e543874-0e1b-4613-bf2d-1f4425a982db)
[61590.880376] ath10k_pci 0000:00:00.0: firmware crashed! (guid d286544e-c0ef-4223-9215-1976a53a7ac5)
[61607.993082] ath10k_pci 0000:00:00.0: firmware crashed! (guid 9834fcb6-3af4-4d4a-b41e-d4889b4bda04)
[61633.951381] ath10k_pci 0000:00:00.0: firmware crashed! (guid d4f145bf-7519-4bbd-89df-95b2d240dca8)
[61650.016322] ath10k_pci 0000:00:00.0: firmware crashed! (guid 6d2931cc-9722-4a28-b0ff-b09bcdb0cdde)
[68399.425484] ath10k_pci 0000:00:00.0: firmware crashed! (guid 33721edb-be65-471f-8498-500910d3ec73)
[68484.554474] ath10k_pci 0000:00:00.0: firmware crashed! (guid 478e319a-a563-4812-9906-daabb1ed8ce5)
[69985.299933] ath10k_pci 0000:00:00.0: firmware crashed! (guid ad381c0d-8dd6-4bcd-b821-529b5b7832d0)

I'll try reverting the last ath10k-ct firmware in OpenWrt master to see if that changes anything.

root@ap0:~# lspci -vv
00:00.0 Network controller: Qualcomm Atheros QCA986x/988x 802.11ac Wireless Network Adapter
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 14
        Region 0: Memory at 10000000 (64-bit, non-prefetchable) [size=2M]
        Expansion ROM at 10200000 [virtual] [disabled] [size=64K]
        Capabilities: [40] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit-
                Address: 00000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [140 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [160 v1] Device Serial Number 00-00-00-00-00-00-00-00
        Kernel driver in use: ath10k_pci
hnyman commented 3 years ago

@stintel did you notice #156 and #178 ?

They are pretty similar firmware crashes.

I also see crashes with my travel router GL-AR750, which is a rather similar MIPS ath79 based device with QCA9887, as your device. (see crash log extract below, note the older -ct firmware 2020-07-02-1)

But I have see no similar frequent crashes with my main router, ARM based ipq806x R7800, which is very popular with OpenWrt (and for which I author a community build).

So, I think that the ath79 and QCA988x might be a common theme making these ath10k-ct firmware crashes more frequent. Also #156 and #178 seem to concern such ath79 devices.

This is with GL-R750 / OpenWrt SNAPSHOT r14869-9867d08e07 / ath10k-firmware-qca9887-ct 2020-07-02-1

Fri Apr  9 23:34:46 2021 kern.warn kernel: [  279.896985] ath10k_pci 0000:00:00.0: SWBA overrun on vdev 0, skipped old beacon
Fri Apr  9 23:34:46 2021 kern.warn kernel: [  279.999390] ath10k_pci 0000:00:00.0: SWBA overrun on vdev 0, skipped old beacon
Fri Apr  9 23:34:46 2021 kern.err kernel: [  280.085469] ath10k_pci 0000:00:00.0: Cannot communicate with firmware, previous wmi cmds: 36904:-5695 36954:-5792 36904:-5821 36904:-5948, jiffies: -4992, attempting to fake crash and restart firmware, dev-flags: 0x42
Fri Apr  9 23:34:46 2021 kern.warn kernel: [  280.105485] ath10k_pci 0000:00:00.0: SWBA overrun on vdev 0, skipped old beacon
Fri Apr  9 23:34:46 2021 kern.warn kernel: [  280.113563] ath10k_pci 0000:00:00.0: failed to send wmi nop: -143
Fri Apr  9 23:34:46 2021 kern.warn kernel: [  280.119905] ath10k_pci 0000:00:00.0: could not request stats (type -268435456 ret -143 specifier 1)
Fri Apr  9 23:34:47 2021 kern.err kernel: [  280.129541] ath10k_pci 0000:00:00.0: firmware crashed! (guid n/a)
Fri Apr  9 23:34:47 2021 kern.info kernel: [  280.135894] ath10k_pci 0000:00:00.0: qca9887 hw1.0 target 0x4100016d chip_id 0x004000ff sub 0000:0000
Fri Apr  9 23:34:47 2021 kern.info kernel: [  280.145517] ath10k_pci 0000:00:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 0
Fri Apr  9 23:34:47 2021 kern.info kernel: [  280.159086] ath10k_pci 0000:00:00.0: firmware ver 10.1-ct-87-__fW-022-538f0906 api 2 features wmi-10.x,has-wmi-mgmt-tx,mfp,txstatus-noack,wmi-10.x-CT,ratemask-CT,regdump-CT,txrate-CT,flush-all-CT,pingpong-CT,ch-regs-CT,nop-CT,set-special-CT,get-temp-CT,tx-rc-CT,cust-stats-CT,retry-gt2-CT,txrate2-CT,beacon-cb-CT,wmi-block-ack-CT crc32 e27449db
stintel commented 3 years ago

From #178

I cannot make progress on firmware crashes that are for this reason: [ 3702.764066] ath10k_pci 0000:00:00.0: Cannot communicate with firmware, previous wmi cmds: 40859:849904 36904:849785 36904:849780 36904:849775, jiffies: 850688, attempting to fake crash and restart firmware, dev-flags: 0x42

This doesn't appear in my logs, so I'd say it's definitely a different problem.

156 also looks different, but Iet's wait for the expert ;-)

greearb commented 3 years ago

I don't think I can make much more progress on wave-1 ath10k firmware. There is the memory corruption and crash bug, which I do not know how to fix, and there is the 'firmware hangs and driver forces restart' issue, which I also do not know how to fix. In general, a fast crash and auto recovery may just be the best we can do here. If stock firmware works better in certain cases or certain platforms, then please use it there. In other uses and other platforms, -ct firmware/driver may be better, so then use it in those cases.

stintel commented 3 years ago

Too bad. I guess I'll just be replacing my QCA-based APs then.

greearb commented 3 years ago

early testing with an MTK ea8450 looks promising...it has good owrt support.

CodeFetch commented 2 years ago

This is with GL-R750 / OpenWrt SNAPSHOT r14869-9867d08e07 / ath10k-firmware-qca9887-ct 2020-07-02-1

Is the QCA9887 actually a wave 1 device? Because Qualcomm sells it as wave 2.

greearb commented 2 years ago

Closing this bug, not something I can fix (wave-1 mem corruption and/or timeout talking to firmware)

stintel commented 2 years ago

Closing this bug, not something I can fix (wave-1 mem corruption and/or timeout talking to firmware)

Very unfortunate, but understandable. As of now I will advise people to stay away from anything QCA as far as possible.