greearb / ath10k-ct

Stand-alone ath10k driver based on Candela Technologies Linux kernel.
111 stars 41 forks source link

[QCA9888] AP Comes Up, Driver/Firmware Fails ~90 Sec. (IPQ 4019 / Linksys "Dallas" / AP DK07.1-c1) #70

Closed jeffsf closed 5 years ago

jeffsf commented 5 years ago

Originally reported on ath10k mailing list as QCA9888: Driver/Firmware Crash After Initialization with follow-on direct email of the same title on February 16, 2019 containing logs and driver/firmware files used at that time.

Summary:

Failure rate is somewhere above 90%

QCA9888 on IPQ4019 platform, attached over PCIe; hardware is virtually unloaded.

Additional data was collected with ath10k_core debug_mask=0x3f

One example showing exceptionally long-term success, including authentication is from the 2019-02-23_1821-PST run, with dmesg and syslog available

Some common errors seen in the logs are

firmware ver 10.4b-ct-9888-fW-012-5815a26a api 5 features mfp,peer-flow-ctrl,txstatus-noack,wmi-10.x-CT,ratemask-CT,regdump-CT,txrate-CT,flush-all-CT,pingpong-CT,ch-regs-CT,nop-CT,set-special-CT,tx-rc-CT,cust-stats-CT,txrate2-CT crc32 4a66be6f

Later versions have been tried.

Snapshot-in-time OpenWrt source at https://github.com/jeffsf/openwrt-ea8300/

DTS segment showing QCA9888 attachment (inherits from #include "qcom-ipq4019.dtsi"

                pci@40000000 {                  // pcie0
                        status = "okay";

                        bridge@0,0 {
                                reg = <0x00000000 0 0 0 0>;
                                #address-cells = <3>;
                                #size-cells = <2>;
                                ranges;

                                wifi2: wifi@1,0 {
                                        compatible = "qcom,ath10k";
                                        status = "okay";
                                        reg = <0x00010000 0 0 0 0>;
                                        // qcom,ath10k-calibration-variant = "<some string>";
                                };
                        };
                };
jeffsf commented 5 years ago

Note that the above testing was done prior to the two, potentially significant commits below. Additional testing underway at this time.

commit c6caa7a27a38929f6d7e76795df6c3dbba7d7351
Author: Felix Fietkau <redacted>
Date:   Fri Mar 1 14:54:31 2019 +0100

    mac80211: add a fix to prevent unsafe queue wake calls during restart

    Signed-off-by: Felix Fietkau <nredacted>

commit 82d306b595b374277fd04c158d4cc7ddf5cf0b37
Author: Felix Fietkau <redacted>
Date:   Fri Mar 1 13:10:53 2019 +0100

    mac80211: backport tx queue start/stop fix

    Among other things, it fixes a race condition on calling ieee80211_restart_hw

    Signed-off-by: Felix Fietkau <redacted>

Edit: The wireless appears "stable" with current builds, based on device-specific commits off a point on OpenWrt master after these commits.

greearb commented 5 years ago

There is no firmware crash, but firmware does appear to just go away. No obvious errors in DBGLOG output from firmware. This is last interesting message before FW goes away. It might be interesting if this command is always the one that is last before failure. Also, you could turn on 'wmi' debugging to get more precise idea of last messages before FW goes away...if you can find a pattern, maybe it would provide a clue.

Sat Feb 23 18:08:48 2019 kern.warn kernel: [ 336.641926] ath10k_pci 0000:01:00.0: bss channel survey timed out

jeffsf commented 5 years ago

Changing from 0x3f to debug_mask=0x203f in /etc/modules.d/ath10k_core Adding debug_mask=0x203f to /etc/modules.d/ath10k-ct (not sure if that does anything)

If those aren't "the right" values, I can easily change them for later tests.

I'll run with the current, seemingly functional build for a while, then go back to a "failing" build.

I'll also try to confirm that the above-mentioned commits are "responsible" for the change in behavior.

greearb commented 5 years ago

So I understand, it is generally working OK for you now, or is it still failing after 90 sec? If mostly fixed but other issues remain, maybe worth opening specific bugs for remaining issues?

jeffsf commented 5 years ago

Correct, working as well as I would expect for a half-complete bring-up of a new device.

At least for me, I worry when things "fix themselves", so trying to at least confirm that the commits mentioned above resolved the issue.

I'll close this out and, one way or another, report if I can confirm that was the case.

jeffsf commented 5 years ago

I wasn't able to convince myself of which specific commit resolved this. Issue has not reoccurred with current OpenWrt master branch.