Closed shelterx closed 5 years ago
This is a crash I have seen before and I added debugging. I think I understand the problem now. For reference, the issue is that the CT firmware cleans up some schedule items on peer deletion, and then later the schedule gets 'completed'. Simplistic 'fifo' sched handling logic caused us to look at the wrong schedule object. I have enabled a search over all existing schedule items in case somehow an item is not handled in fifo manner, and another bit of code that should just re-kick the scheduler and ignore the mis-matched sched-id for the case that you hit. This bug was a regression added in previous attempts to fix some use-after-free bugs in the scheduler code.
Please try the attached firmware for 9984 to see if it works better for you. [deleted, it was invalid, see next comment]
FW stack trace: 0x0099ba07 RAM: tx_pfsched_completion_callback /home/greearb/git/digitalpath/qca-ct-3.5.3.50-9984/wlan/mac_core/src/wal/AR/tx_sched/tx_prefetch_sched.c:1470 0x4099ba07 RAM: tx_pfsched_completion_callback /home/greearb/git/digitalpath/qca-ct-3.5.3.50-9984/wlan/mac_core/src/wal/AR/tx_sched/tx_prefetch_sched.c:1470 0x8099e381 RAM: _tx_sch_sched_cmd_done /home/greearb/git/digitalpath/qca-ct-3.5.3.50-9984/wlan/mac_core/src/wal/AR/tx_sched/tx_sched_wifi_ip02.c:649 0x809972ce RAM: _tx_send_seq_trig_dsr_done /home/greearb/git/digitalpath/qca-ct-3.5.3.50-9984/wlan/mac_core/src/wal/AR/tx/wifi_ip02/ar_wal_tx_seq.c:2052 0x809949b2 RAM: _tx_send_completion_dsr_hdlr /home/greearb/git/digitalpath/qca-ct-3.5.3.50-9984/wlan/mac_core/src/wal/AR/tx/wifi_ip02/ar_wal_tx_send.c:9050 0x8098fc30 RAM: _tx_send_completion_dsr_hdlr_wrapper /home/greearb/git/digitalpath/qca-ct-3.5.3.50-9984/wlan/mac_core/src/wal/AR/tx/wifi_ip02/ar_wal_tx_send.c:1452 0x80963ad3 ROM: cmnos_intr_handle_pending_dsrs /local/mnt/workspace/CRMBuilds/CNSS.BL.3.0-00058-S-1_20150213_182825/b/cnss_proc/wlan/mac_core/src/os/common/cmnos_intrinf.c:335 0x80960e80 ROM: check_idle /local/mnt/workspace/CRMBuilds/CNSS.BL.3.0-00058-S-1_20150213_182825/b/cnss_proc/wlan/mac_core/src/os/athos/athos_main.c:2017 0x80960e51 ROM: athos_main /local/mnt/workspace/CRMBuilds/CNSS.BL.3.0-00058-S-1_20150213_182825/b/cnss_proc/wlan/mac_core/src/os/athos/athos_main.c:1998 0x80960e9d ROM: main /local/mnt/workspace/CRMBuilds/CNSS.BL.3.0-00058-S-1_20150213_182825/b/cnss_proc/wlan/mac_core/src/os/athos/athos_main.c:2051 0x40960024 ROM: _stext /local/mnt/workspace/CRMBuilds/CNSS.BL.3.0-00058-S-1_20150213_182825/b/cnss_proc/wlan/mac_core/src/os/athos/xtos/crt1-tiny.S:90
Sorry, previous binary attachment was not correct, please test this one instead. firmware-5-full-community.bin.gz
Will try and report back. (It doesn't happen very often and seems to be more frequent depending on what I stream so it might take 2-3 days).
No good at all. This FW crashes constantly, see provided file for more log output. kernellog.txt
[1435.435997] ath10k_pci 0001:01:00.0: firmware crashed! (guid n/a) [ 1435.436096] ath10k_pci 0001:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe [ 1435.441165] ath10k_pci 0001:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 0 [ 1435.455569] ath10k_pci 0001:01:00.0: firmware ver 10.4b-ct-9984-fW-012-bb3d19701 api 5 features mfp,peer-flow-ctrl,txstatus-noack,wmi-10.x-CT,ratemask-CT,regdump-CT,txrate-CT,flush-all-CT,pingpong-CT,ch-regs-CT,nop-CT,set-special-CT,tx-rc-CT,cust-stats-CT,txrate2-CT crc32 1279b325 [ 1435.462901] ath10k_pci 0001:01:00.0: board_file api 2 bmi_id 0:2 crc32 cf58c3bc [ 1435.483979] ath10k_pci 0001:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 32 raw 0 hwcrypto 1 [ 1435.493129] ath10k_pci 0001:01:00.0: firmware register dump: [ 1435.501159] ath10k_pci 0001:01:00.0: [00]: 0x0000000A 0x00000000 0x0099B9FE 0x00000000 [ 1435.507062] ath10k_pci 0001:01:00.0: [04]: 0x00000000 0x00060024 0x00000000 0x00000000 [ 1435.514787] ath10k_pci 0001:01:00.0: [08]: 0x00000000 0x00000000 0x00000000 0x00000000 [ 1435.522688] ath10k_pci 0001:01:00.0: [12]: 0x00000000 0x00000000 0x00000000 0x00000000 [ 1435.530588] ath10k_pci 0001:01:00.0: [16]: 0x00985E47 0x009606CA 0x009606CA 0x0099B9FE [ 1435.538487] ath10k_pci 0001:01:00.0: [20]: 0x00000000 0x00401C10 0x00000000 0x00000000 [ 1435.546384] ath10k_pci 0001:01:00.0: [24]: 0x00000000 0x00000000 0x00000000 0x00000000 [ 1435.554284] ath10k_pci 0001:01:00.0: [28]: 0x00000000 0x00000000 0x00000000 0x00000000 [ 1435.562184] ath10k_pci 0001:01:00.0: [32]: 0x00000000 0x00000000 0x00000000 0x00000000 [ 1435.570085] ath10k_pci 0001:01:00.0: [36]: 0x00000000 0x00000000 0x00000000 0x00000000 [ 1435.577995] ath10k_pci 0001:01:00.0: [40]: 0x00000000 0x00000000 0x00000000 0x00000000 [ 1435.585883] ath10k_pci 0001:01:00.0: [44]: 0x00000000 0x00000000 0x00000000 0x00000000 [ 1435.593782] ath10k_pci 0001:01:00.0: [48]: 0x00000000 0x00000000 0x00000000 0x00000000 [ 1435.601679] ath10k_pci 0001:01:00.0: [52]: 0x00000000 0x00000000 0x00000000 0x00000000 [ 1435.609580] ath10k_pci 0001:01:00.0: [56]: 0x00000000 0x00000000 0x00000000 0x00000000 [ 1435.617478] ath10k_pci 0001:01:00.0: Copy Engine register dump: [ 1435.625387] ath10k_pci 0001:01:00.0: [00]: 0x0004a000 11 11 3 3 [ 1435.631200] ath10k_pci 0001:01:00.0: [01]: 0x0004a400 31 31 421 422 [ 1435.637814] ath10k_pci 0001:01:00.0: [02]: 0x0004a800 62 62 61 62 [ 1435.644221] ath10k_pci 0001:01:00.0: [03]: 0x0004ac00 0 0 2 0 [ 1435.650643] ath10k_pci 0001:01:00.0: [04]: 0x0004b000 453 453 40 0 [ 1435.657066] ath10k_pci 0001:01:00.0: [05]: 0x0004b400 23 23 118 119 [ 1435.663490] ath10k_pci 0001:01:00.0: [06]: 0x0004b800 22 22 22 22 [ 1435.669913] ath10k_pci 0001:01:00.0: [07]: 0x0004bc00 1 1 1 1 [ 1435.676337] ath10k_pci 0001:01:00.0: [08]: 0x0004c000 0 0 127 0 [ 1435.682761] ath10k_pci 0001:01:00.0: [09]: 0x0004c400 0 0 0 0 [ 1435.689184] ath10k_pci 0001:01:00.0: [10]: 0x0004c800 0 0 0 0 [ 1435.695607] ath10k_pci 0001:01:00.0: [11]: 0x0004cc00 0 0 0 0 [ 1435.704055] ath10k_pci 0001:01:00.0: debug log header, dbuf: 0x422fb8 dropped: 0 [ 1435.709472] ath10k_pci 0001:01:00.0: [0] next: 0x422fa0 buf: 0x4195d0 sz: 1500 len: 28 count: 1 free: 0
Sorry about that, I had a logic flaw in the last patch. Please try this one instead. And, please run with debug-level of 0xc0000020 and send me 'dmesg' output after the system has been running for a bit even if it doesn't crash or have obvious issues.
Wifi went dead with that firmware, devices got connected but no internet, couöldn't ping them either. dmesg.txt
On 1/3/19 3:01 PM, shelterx wrote:
Wifi went dead with that firmware, devices got connected but no internet, couöldn't ping them either. dmesg.txt https://github.com/greearb/ath10k-ct/files/2725550/dmesg.txt
Seems some sort of bad interaction with powersave. Can you get another log sooner after startup where dmesg still shows at least some of the initial bootup text? I am hoping to better understand how it gets to this broken state.
Thanks, Ben
-- Ben Greear greearb@candelatech.com Candela Technologies Inc http://www.candelatech.com
Here is another image. It will likely assert early in your test case, but hopefully the resulting logs will let me better understand the problem. Can you also let me know the device(s) that connect to your AP? Maybe we can reproduce the issue locally.
The dmesg buffer gets filled so quickly, tried to pipe it to a file but it got empty. But here's some debug info together with a crash. Connected devices are usually iPhone 8 Plus, AppleTV 4k, ChromeCast Ultra, iPad Air and Raspberry Pi 2. dmesg.txt
Here's another log right after start, no crash but no working wifi. First part is from logread, it's continued in the 2_dmesg.txt file. 1_logread.txt.txt 2_dmesg.txt
I backed out part of the code that originally triggered these issues. This probably means there is still a use-after-free bug in the code, but probably it is quite rare, and maybe I can find some other way to work around that does have the tx-stall and related issues. Please try the attached firmware: [edit, snip] Here is a proper image, previous one was missing the intended change.
Wifi is dead with that image. Connects but nothing works.
Please post dmesg so I can double-check it is expected version etc. I'll go back and back out more of the previous troublesome commit later today.
Can't test right now but it's the version you posted above, i'm 99% sure of it.
Here it is. dmesg-2019-01-08.txt
I found yet another logic bug in the code in question. I am going to try to fix that and test with a co-workers iphone to see if we can verify at least basic functionality...hopefully will have something worth testing tomorrow.
Ok, here is another attempt. It works with my android phone, at least. firmware-5-full-community.bin.gz
Nope, no go with firmware ver 10.4b-ct-9984-fW-012-51585cf99 api 5 Devices shows as connected both in OpenWRT and the device itself. The iPad Air loaded a page in Safari then every connection went dead and can't connect anywhere. I also noticed that the AppleTV connects at lower rates than the official firmware-5.bin_10.4-3.9.0.1-00008. The official firmware is actually flawless for me.
Sorry, I wish I could reproduce it. Here is another build...this disables the 'reorder' logic in the sched callback....maybe that was the problem. firmware-5-full-community.bin.gz
@greearb I also had daylie crashes on my Archer C5, when having much Traffic and Clients. Now i Installed a Backup Router (Archer C7) with Openwrt latest Stable and moved all Clients there. While only 1 Client is using the C5 with current GIT build, the router did not crash within the last 5 Days. Normally there are ~20 Systems (4PC, 16IOT) Connected.
Disabling the reoder logic seems to have fixed it. Tested Apple TV, iPhone, iPad, LGwebOSTV and Raspberry Pi. Now we'll have to wait and see how it does in the long run. I hope you can get the real cause of the bug fixed tho'.
The upstream code completely ignored the reordering. That seemed wrong to me, but maybe it works well enough anyway. Possibly I only need to pay attention to reordering in very certain cases. Let me know if you see more crashes or problems.
Yes, I ran the AppleTV for a while today which caused issues before. I also played around with the iPad, no crashes or issues yet.
I'd say it works now. Haven't seen the CT firmware this stable on my R7800 before. All connected devices TX/RX rates looks normal too, I haven't done any benchmarking at all so I can't really say anything about that, but normal internet bandwidth measurements are all good.
Just crashed again. dmesg-2019-01-13.txt
Again dmesg-2019-01-13-2.txt
Those crashes are the same as for bug 58 it seems. Please try this image, it has more debugging to help track down this issue. firmware-5-full-community.bin.gz
No crash yet with the above image, oddly enough.
Closing this bug, will track the rate-ctrl crash in bug 58.
Here's a crash I got on the R7800