Closed johnylin76 closed 1 year ago
@bkokoszx @mwasko @abonislawski FYI. unfortunately another one.
@johnylin76 do you have chance to capture the FW log? @1994lwz can you try to reproduce it?
Prior to the crash we have a lot of IPC errors and attempts at setting a large config to component 57. What is component 57 ?
2021-09-15T05:38:36.475685Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:38:36.581500Z ERR kernel: [ 853.364841] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:36.581518Z ERR kernel: [ 853.364845] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:40.095015Z ERR kernel: [ 856.876570] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:40.095029Z ERR kernel: [ 856.876574] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:47.076017Z ERR kernel: [ 863.854180] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:47.076031Z ERR kernel: [ 863.854184] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:50.020019Z ERR kernel: [ 866.796418] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:50.020033Z ERR kernel: [ 866.796422] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:52.253016Z ERR kernel: [ 869.028375] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:52.253032Z ERR kernel: [ 869.028379] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:55.737015Z ERR kernel: [ 872.510302] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:55.737029Z ERR kernel: [ 872.510306] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:57.770043Z ERR kernel: [ 874.542636] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:57.770057Z ERR kernel: [ 874.542640] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:01.463014Z ERR kernel: [ 878.233582] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:01.463028Z ERR kernel: [ 878.233585] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:03.458016Z ERR kernel: [ 880.227898] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:03.458029Z ERR kernel: [ 880.227903] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:03.372964Z WARNING cras_server[1850]: message repeated 5 times: [ No chmap queried! Skip chmap set]
2021-09-15T05:39:04.521931Z INFO ARCVM(32)[4802]: [ 349.525556] healthd: battery l=100 v=13080 t=0.0 h=1 st=4 c=0 fc=4085000 cc=0 chg=a
2021-09-15T05:39:06.815017Z ERR kernel: [ 883.582875] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:06.815035Z ERR kernel: [ 883.582879] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:08.682024Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:08.766045Z ERR kernel: [ 885.532714] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:08.766057Z ERR kernel: [ 885.532719] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:11.680015Z ERR kernel: [ 888.445989] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:11.680029Z ERR kernel: [ 888.445994] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:13.603026Z ERR kernel: [ 890.367241] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:13.603040Z ERR kernel: [ 890.367245] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:16.561020Z ERR kernel: [ 893.323743] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:16.561042Z ERR kernel: [ 893.323747] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:13.516979Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:17.221101Z INFO ARCVM(32)[4802]: [ 362.218463] healthd: battery l=100 v=13080 t=0.0 h=1 st=4 c=0 fc=4085000 cc=0 chg=a
2021-09-15T05:39:19.047926Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:19.131016Z ERR kernel: [ 895.892874] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:19.131032Z ERR kernel: [ 895.892879] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:22.043015Z ERR kernel: [ 898.803244] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:22.043030Z ERR kernel: [ 898.803248] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:24.585019Z ERR kernel: [ 901.344545] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:24.585036Z ERR kernel: [ 901.344549] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:24.499112Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:29.857847Z NOTICE temp_logger[6248]: x86_pkg_temp:43C INT3400_Thermal:20C TSR0:35C TSR1:33C TSR2:37C TSR3:35C TCPU:45C iwlwifi_1:34C PL1:17.000W
2021-09-15T05:39:30.401768Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:30.488021Z ERR kernel: [ 907.244012] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:30.488034Z ERR kernel: [ 907.244016] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:33.362018Z ERR kernel: [ 910.116573] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:33.362032Z ERR kernel: [ 910.116578] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:35.593991Z ERR kernel: [ 912.347471] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:35.594012Z ERR kernel: [ 912.347475] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:38.566053Z ERR kernel: [ 915.318051] sof-audio-pci 0000:00:1f.3: error : DSP panic!
2021-09-15T05:39:38.566071Z ERR kernel: [ 915.318070] sof-audio-pci 0000:00:1f.3: status: fw entered - code 00000005
2021-09-15T05:39:38.566071Z ERR kernel: [ 915.318158] sof-audio-pci 0000:00:1f.3: error: runtime exception
2021-09-15T05:39:38.566072Z ERR kernel: [ 915.318159] sof-audio-pci 0000:00:1f.3: error: trace point 00004000
2021-09-15T05:39:38.566073Z ERR kernel: [ 915.318160] sof-audio-pci 0000:00:1f.3: error: panic at :0
2021-09-15T05:39:38.566074Z ERR kernel: [ 915.318162] sof-audio-pci 0000:00:1f.3: error: DSP Firmware Oops
2021-09-15T05:39:38.566080Z ERR kernel: [ 915.318163] sof-audio-pci 0000:00:1f.3: error: Exception Cause: LoadStorePIFDataErrorCause, Synchronous PIF data error during LoadStore access
2021-09-15T05:39:38.566081Z ERR kernel: [ 915.318164] sof-audio-pci 0000:00:1f.3: EXCCAUSE 0x0000000d EXCVADDR 0x0000bb88 PS 0x00060b25 SAR 0x00000000
2021-09-15T05:39:38.566082Z ERR kernel: [ 915.318165] sof-audio-pci 0000:00:1f.3: EPC1 0xbe02e437 EPC2 0xbe02d7e6 EPC3 0x00000000 EPC4 0x00000000
2021-09-15T05:39:38.566082Z ERR kernel: [ 915.318167] sof-audio-pci 0000:00:1f.3: EPC5 0x00000000 EPC6 0x00000000 EPC7 0x00000000 DEPC 0x00000000
2021-09-15T05:39:38.566084Z ERR kernel: [ 915.318168] sof-audio-pci 0000:00:1f.3: EPS2 0x00060120 EPS3 0x00000000 EPS4 0x00000000 EPS5 0x00000000
2021-09-15T05:39:38.566084Z ERR kernel: [ 915.318169] sof-audio-pci 0000:00:1f.3: EPS6 0x00000000 EPS7 0x00000000 INTENABL 0x00000000 INTERRU 0x00000262
2021-09-15T05:39:38.566085Z ERR kernel: [ 915.318170] sof-audio-pci 0000:00:1f.3: stack dump from 0xbe143d40
2021-09-15T05:39:38.566085Z ERR kernel: [ 915.318172] sof-audio-pci 0000:00:1f.3: 0xbe143d40: 00000000 3ff00000 3fe1feb3 41b00000
2021-09-15T05:39:38.566086Z ERR kernel: [ 915.318173] sof-audio-pci 0000:00:1f.3: 0xbe143d44: 00000000 00000000 00000000 00000000
2021-09-15T05:39:38.566086Z ERR kernel: [ 915.318174] sof-audio-pci 0000:00:1f.3: 0xbe143d48: 00000000 00000000 00000000 00000000
2021-09-15T05:39:38.566087Z ERR kernel: [ 915.318176] sof-audio-pci 0000:00:1f.3: 0xbe143d4c: f66ba030 ffff9667 01cebdc0 ffffaed4
2021-09-15T05:39:38.566088Z ERR kernel: [ 915.318177] sof-audio-pci 0000:00:1f.3: 0xbe143d50: 00000000 00000000 b641b700 ffffffff
2021-09-15T05:39:38.566088Z ERR kernel: [ 915.318178] sof-audio-pci 0000:00:1f.3: 0xbe143d54: b05d7828 ffff9667 000c0800 00000000
2021-09-15T05:39:38.566089Z ERR kernel: [ 915.318179] sof-audio-pci 0000:00:1f.3: 0xbe143d58: 00000000 00000000 01cebe50 ffffaed4
2021-09-15T05:39:38.566090Z ERR kernel: [ 915.318180] sof-audio-pci 0000:00:1f.3: 0xbe143d5c: 01cebe78 ffffaed4 c05ac157 ffffffff
oh it appears the oops is after the set large config to comp 57 (see time delay)
2021-09-15T05:39:35.594012Z ERR kernel: [ 912.347475] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:38.566053Z ERR kernel: [ 915.318051] sof-audio-pci 0000:00:1f.3: error : DSP panic!
The other crashes show similar delays with previous IPC errors.
Can we reproduce with kernel IPC logs ON and FW logs ON ?
This issue is reproduce on the chrome platform at SHZ1 site.
[core dump]
# Core header:
# arch 00000001 # totalsize 0000017c # stackptr be143d40 # stackoffset 000001a8
# configidhi c2f3fbfe # configidlo 1cc6c29b # numaregs 00000040
# CPU registers:
# exccause 0000000d # excvaddr 0000bb88 # ps 00060225
# epc1 be02e437 # epc2 be02d7e6 # epc3 00000000 # epc4 00000000
# epc5 00000000 # epc6 00000000 # epc7 00000000
# eps2 00060a20 # eps3 00000000 # eps4 00000000 # eps5 00000000
# eps6 00000000 # eps7 00000000
# depc 00000000 # intenable 00000000 # interrupt 00000262 # sar 00000000
# debugcause 00000000
# windowbase 0000000e # windowstart 00005400
# excsave1 00000000
# ar56 be04a59f # ar57 be143d50 # ar58 be00881c # ar59 1cc6c29b
# ar60 c2f3fbfe # ar61 0000017c # ar62 00000001 # ar63 be143d40
# ar0 be04a51f # ar1 be143d30 # ar2 be008f28 # ar3 00000016
# ar4 00000028 # ar5 00000538 # ar6 9e266400 # ar7 00000004
# ar8 be03ec8b # ar9 be143d00 # ar10 be0089a8 # ar11 be143d40
# ar12 00000538 # ar13 00000800 # ar14 be2d2480 # ar15 00000003
# ar16 be03ebfb # ar17 be143ce0 # ar18 be0089a8 # ar19 be144278
# ar20 00000000 # ar21 be008ee0 # ar22 be02da67 # ar23 be146b30
# ar24 be03ebfb # ar25 be143cc0 # ar26 be00a670 # ar27 00000000
# ar28 00000000 # ar29 be00ae70 # ar30 00000005 # ar31 00000005
# ar32 be144170 # ar33 be144150 # ar34 be2d8e00 # ar35 00000000
# ar36 be144180 # ar37 00000000 # ar38 be03dc34 # ar39 00060826
# ar40 fe04fd68 # ar41 be143d90 # ar42 be143dc0 # ar43 0000000d
# ar44 00000000 # ar45 9e2664d0 # ar46 9e266480 # ar47 be2d8e4c
# ar48 be02e449 # ar49 be143d60 # ar50 0dead006 # ar51 00000000
# ar52 be143d90 # ar53 00000000 # ar54 00000001 # ar55 00000001
# windowbase: E
# 0 1 2
# windowstart: b101010000000000
# reg a0 a1
# (return) (sptr)
# --- -------- -------
# 0 # ar56 be04a59f be143d50
# 1 # ar48 be02e449 be143d60
# 2 # ar40 fe04fd68 be143d90
# Stack dumped from be143d40 dwords num 406
# *EXCEPTION*
# exccause: LoadStorePIFDataErrorCause: ynchronous PIF data error during LoadStore access
# excvaddr: 48008
@johnylin76 do you have chance to capture the FW log?
Here are the logs with kernel IPC logs ON and FW logs ON: sof-logger.txt messages.txt
The logs show 2 things 1) What appears to be heavy load on the DSP, as the scheduler cant shedule the agent on time ?
TIMESTAMP (us) DELTA C# COMPONENT LOCATION CONTENT ktime=370.318s @ 2021-09-17 17:42:59 +0800 CST
Found valid LDC address after skipping 48 bytes (one line uses 20 + 0 to 16 bytes)
[ 0.000000] ( 0.000000) c0 unknown 0.0 src/lib/agent.c:79 WARN validate(), ll drift detected, delta = 51426
[ 231528321.320724] ( 231528320.000000) c0 sa src/lib/agent.c:79 WARN validate(), ll drift detected, delta = 51273
[ 231568316.631635] ( 39995.312500) c0 sa src/lib/agent.c:79 WARN validate(), ll drift detected, delta = 51290
[ 231608321.109212] ( 40004.476562) c0 sa src/lib/agent.c:79 WARN validate(), ll drift detected, delta = 51377
[ 231648337.305539] ( 40016.195312) c0 sa src/lib/agent.c:79 WARN validate(), ll drift detected, delta = 51574
[ 231688318.553950] ( 39981.250000) c0 sa src/lib/agent.c:79 WARN validate(), ll drift detected, delta = 51322
[ 231728324.073194] ( 40005.519531) c0 sa src/lib/agent.c:79 WARN validate(), ll drift detected, delta = 51435
But the delay is consistent, meaning we are likely scheduling agent in a timely manner but perhaps with the wrong clock or timebase.
2) We are doing a lot (~200) IPC configuration messages prior to the crash.
2021-09-17T09:38:57.731047Z DEBUG kernel: [ 128.874270] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.732027Z DEBUG kernel: [ 128.874692] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.732037Z DEBUG kernel: [ 128.874705] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.732040Z DEBUG kernel: [ 128.875252] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.732050Z DEBUG kernel: [ 128.875301] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.733017Z DEBUG kernel: [ 128.875705] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.733030Z DEBUG kernel: [ 128.875716] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.733037Z DEBUG kernel: [ 128.876259] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.733040Z DEBUG kernel: [ 128.876299] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.734022Z DEBUG kernel: [ 128.876699] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.734030Z DEBUG kernel: [ 128.876710] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.734040Z DEBUG kernel: [ 128.877260] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.734042Z DEBUG kernel: [ 128.877293] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.735041Z DEBUG kernel: [ 128.877708] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.735044Z DEBUG kernel: [ 128.877730] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.736014Z ERR kernel: [ 128.878698] sof-audio-pci 0000:00:1f.3: error : DSP panic!
2021-09-17T09:38:57.736019Z DEBUG kernel: [ 128.878702] sof-audio-pci 0000:00:1f.3: panic: dsp_oops_offset 788480 offset 788480
2021-09-17T09:38:57.736020Z ERR kernel: [ 128.878718] sof-audio-pci 0000:00:1f.3: status: fw entered - code 00000005
2021-09-17T09:38:57.736021Z ERR kernel: [ 128.878806] sof-audio-pci 0000:00:1f.3: error: runtime exception
We are not doing anything else at this time so could be a race when we are copying config data.
This issue is reproduce on the chrome platform at SHZ1 site.
Hi @1994lwz , may I know which device do you use? Delbin? Drobit? Do you know the phase (Proto/DVT/EVT) and the SKU of your device? Thanks.
This issue is reproduce on the chrome platform at SHZ1 site.
Hi @1994lwz , may I know which device do you use? Delbin? Drobit? Do you know the phase (Proto/DVT/EVT) and the SKU of your device? Thanks.
the device is Delbin EVT SKU2.
@1994lwz @bkokoszx any update or findings so far ?
@lgirdwood Not yet, I was debugging problem with D0i3->D0 multicore transition.
Is there any way to reproduce this issue by cmd line?
Unfortunately I still cannot make a script to successfully reproduce the issue with "cras_test_client --plug
However, I can still reproduce the issue on my Drobit device by the manual steps, although it is also not easy to make it happen. My strategy is rebooting the device and trying again when I failed to reproduce the issue after 30~40 times of headset plug-and-unplug iterations. (I rebooted 3 times and finally hit it...)
Attached logs: messages.txt sof-logger.txt
The log from sof-logger looks weird and unfinished, but it is indeed the whole log I could retrieve after hitting the issue.
@johnylin76 Are you able to provide core dump file (after reproduction) with corresponding firmware .lst file (sof-tgl.lst - one of a building artifact)?
@bkokoszx Do you mean the core dump file generated by the method as mentioned in sof/issues/4648? I have generated one. Please see as attached:
@johnylin76
Yes, but additionally .lst file will be helpful (sof-tgl.lst). It contains disassembly of the code, so that we can associate addresses from coredump file provided by you with the code.
.lst file is one of the build artifacts (in
here's a tentative fix for this issue: https://github.com/thesofproject/sof/pull/4833
Still being tested
here's a tentative fix for this issue: #4833
Where would the crash happen here? I am agree it is a good fix but I am having trouble reasoning out the race that would leave the lib with either freed memory or an incorrect size.
Where would the crash happen here? I am agree it is a good fix but I am having trouble reasoning out the race that would leave the lib with either freed memory or an incorrect size.
@cujomalainey I'm not certain what could cause the crash given that we're likely sending the same blob. So, theoretically the size should remain the same. But from the logs we see that there are 2 blobs (or the same blob senttwice) sent during headset removal/pug-in. Do you know what those 2 blobs are for?
@ranj063 we know they are the hotword models, but something is being weird in how they are sent. Previously on 012 this was not an issue. FYI i now have a script you can run to repro this. Repro times range from 5s to 10min
cras_test_client --capture_f /dev/null &
cras_test_client --listen /dev/null &
while true; do cras_test_client --set_hotword_model 15:0:null & cras_test_client --set_hotword_model 15:0:en_all; done
@cujomalainey do you mind trying my PR to see if the panic still hapens on your device?
I don't have the latest IGO lib. @johnylin76 can test. He will be coming online soon.
I don't have the latest IGO lib. @johnylin76 can test. He will be coming online soon.
with my PR i'm able to hit the issue much sooner than without
@ranj063 I've tried to reproduce panic remotely using @cujomalainey script and I did not hit reproduction. Only I saw ipc errors:
[ 290.425084] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
[ 290.425086] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
[ 290.426537] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
[ 290.426539] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
Update: I think I hit the repro. I will double check and let you know.
According to @bzhg when he was checking crash, the actual crash is inside the microspeech lib on a load, but he can't tell what it is loading. I am betting we are pulling memory out from under its feet and causing it to jump into oblivion. As to how I am not sure.
According to @bzhg when he was checking crash, the actual crash is inside the microspeech lib on a load, but he can't tell what it is loading. I am betting we are pulling memory out from under its feet and causing it to jump into oblivion. As to how I am not sure.
@cujomalainey do you think that instead of realloc'ing the blob_handler->data pointer to a new pointer everytime, allocating data with a max size and only updating it would help solve the issue?
@ranj063 it would be an interesting data point. I wonder if maybe we need to add trace errors and flushes to see if we are maybe hitting those paths on a crash
It loos like with @ranj063 change (#4833) still I can reproduce DSP panic. The backtrace retrieved from coredumper is as follows:
#0 0xbe05fd00 in literals ()
#1 0xbe049a6f in dump_stack (p=3187705884, addr=0x1cc6c29b, offset=3270769662, limit=380, stack_ptr=0x1) at /home/admin/Documents/sof/sof/src/arch/xtensa/include/arch/lib/cache.h:79
#2 0xbe049aef in panic_dump (p=233492486, panic_info=0x0, data=0xbe142d90 <primary_core_data+3216>) at /home/admin/Documents/sof/sof/src/arch/xtensa/include/arch/debug/panic.h:45
#3 0xbe02e449 in exception () at /home/admin/Documents/sof/sof/src/arch/xtensa/init.c:115
#4 0xbe04f2e8 in _GeneralException ()
#5 0xbe06069b in GoogleHotwordProcessSamples ()
#6 0xbe04cfc3 in GoogleHotwordDspProcess ()
#7 0xbe03963b in ghd_detect (dev=0xbe2d8d00, stream=0x2, samples=0xbe177ac8, bytes=640) at /home/admin/Documents/sof/sof/src/audio/google_hotword_detect.c:361
#8 0xbe039734 in ghd_copy (dev=0xbe2d8d00) at /home/admin/Documents/sof/sof/src/audio/google_hotword_detect.c:426
#9 0xbe03dbc2 in comp_copy (dev=0xbe2d8d00) at /home/admin/Documents/sof/sof/src/include/sof/audio/component_ext.h:203
#10 0xbe03dcf2 in pipeline_comp_copy (current=0xbe2d8d00, calling_buf=0xbe143180 <primary_core_data+4224>, ctx=0xbe143180 <primary_core_data+4224>, dir=0)
#11 0xbe03db05 in pipeline_for_each_comp (current=0xbe2d8d00, ctx=0xbe143180 <primary_core_data+4224>, dir=0) at /home/admin/Documents/sof/sof/src/audio/pipeline/pipeline-graph.c:431
#12 0xbe03dcde in pipeline_comp_copy (current=0xbe2d8e00, calling_buf=0xbe143180 <primary_core_data+4224>, ctx=0xbe143180 <primary_core_data+4224>, dir=0)
#13 0xbe03dd5c in pipeline_copy (p=0xbe2d2500) at /home/admin/Documents/sof/sof/src/audio/pipeline/pipeline-stream.c:165
#14 0xbe03e7b3 in pipeline_task (arg=0xbe2d2500) at /home/admin/Documents/sof/sof/src/audio/pipeline/pipeline-schedule.c:58
#15 0xbe04aa61 in schedule_ll_tasks_execute (sch=0xbe250180) at /home/admin/Documents/sof/sof/src/include/sof/schedule/task.h:88
#16 0xbe04ab51 in schedule_ll_tasks_run (data=0xbe250180) at /home/admin/Documents/sof/sof/src/schedule/ll_schedule.c:231
#17 0xbe041cfd in irq_lvl2_level2_handler (data=0x400000) at /home/admin/Documents/sof/sof/src/drivers/intel/cavs/interrupt.c:87
#18 0xbe04fe20 in _Level2FromVector ()
#19 0xbe02d7e6 in arch_wait_for_interrupt (level=0) at /home/admin/Documents/sof/sof/src/arch/xtensa/include/arch/lib/wait.h:40
Hi @bkokoszx do you reproduce by script or manual plug/unplug headset? I adapted #4483 commit and tried to reproduce by manual plug/unplug method in these 2 days, but I didn't hit the panic issue anymore. (however it could just be the luck..)
@johnylin76 is the PR nunmber correct? 4483 seems related to zephyr?
@johnylin76 I was trying to reproduce by script.
Hi @bkokoszx do you reproduce by script or manual plug/unplug headset? I adapted #4483 commit and tried to reproduce by manual plug/unplug method in these 2 days, but I didn't hit the panic issue anymore. (however it could just be the luck..)
My observation matches with @johnylin76 's. I applied only PR: https://github.com/thesofproject/sof/pull/4833 .
I can not reproduce "DSP panic issue" in manual test.
Manual test steps are same as the one mentioned in bug description + GHW running in background..
Sorry I mean #4833, and same test steps as @sathya-nujella (as bug description + GHW running in background)
4 0xbe04f2e8 in _GeneralException ()
5 0xbe06069b in GoogleHotwordProcessSamples ()
6 0xbe04cfc3 in GoogleHotwordDspProcess ()
7 0xbe03963b in ghd_detect (dev=0xbe2d8d00, stream=0x2, samples=0xbe177ac8, bytes=640) at /home/admin/Documents/sof/sof/src/audio/google_hotword_detect.c:361
8 0xbe039734 in ghd_copy (dev=0xbe2d8d00) at /home/admin/Documents/sof/sof/src/audio/google_hotword_detect.c:426
9 0xbe03dbc2 in comp_copy (dev=0xbe2d8d00) at /home/admin/Documents/sof/sof/src/include/sof/audio/component_ext.h:203
@johnylin76, @cujomalainey based on coredump backtrace the panic occurs in Google Hotword module on sample processing. Can we assign Google Hotword developer to involve in issue debug and help to find issue root cause?
I've added following trace points to the code:
diff --git a/src/audio/google_hotword_detect.c b/src/audio/google_hotword_detect.c
index a2b5a18..1052fe5 100644
--- a/src/audio/google_hotword_detect.c
+++ b/src/audio/google_hotword_detect.c
@@ -235,6 +235,8 @@ static int ghd_ctrl_set_bin_data(struct comp_dev *dev,
struct comp_data *cd = comp_get_drvdata(dev);
int ret;
+ trace_point(0x5000);
+
switch (cdata->data->type) {
case GOOGLE_HOTWORD_DETECT_MODEL:
// comp_info(dev, "ghd_ctrl_set_bin_data(): comp_data_blob_set_cmd");
@@ -247,6 +249,8 @@ static int ghd_ctrl_set_bin_data(struct comp_dev *dev,
cdata->data->type);
return -EINVAL;
}
+
+ trace_point(0x5001);
}
static int ghd_ctrl_set_data(struct comp_dev *dev,
@@ -341,6 +345,8 @@ static void ghd_detect(struct comp_dev *dev,
uint32_t sample_bytes;
int ret;
+ trace_point(0x6002);
+
if (cd->detected)
return;
@@ -356,10 +362,14 @@ static void ghd_detect(struct comp_dev *dev,
cd->history_bytes += bytes;
}
+ trace_point(0x6003);
+
comp_dbg(dev, "GoogleHotwordDspProcess(0x%x, %u)",
(uint32_t)samples, bytes / sample_bytes);
ret = GoogleHotwordDspProcess(samples, bytes / sample_bytes,
&preamble_length_ms);
+ trace_point(0x6004);
+
if (ret == 1) {
cd->detected = 1;
@@ -390,6 +400,8 @@ static int ghd_copy(struct comp_dev *dev)
uint32_t bytes, tail_bytes, head_bytes = 0;
int ret;
+ trace_point(0x6000);
+
comp_info(dev, "ghd_copy()");
/* Check for new model */
@@ -424,6 +436,8 @@ static int ghd_copy(struct comp_dev *dev)
else
head_bytes = bytes - tail_bytes;
+ trace_point(0x6001);
+
if (tail_bytes)
ghd_detect(dev, stream, stream->r_ptr, tail_bytes);
if (head_bytes)
@@ -432,6 +446,8 @@ static int ghd_copy(struct comp_dev *dev)
/* calc new available */
comp_update_buffer_consume(source, bytes);
+ trace_point(0x6005);
+
return 0;
}
And after reproduction with script: https://github.com/thesofproject/sof/issues/4769#issuecomment-931755456 and firmware with applied #4833, I got following regs hexdump:
localhost /sys/kernel/debug/sof # hexdump regs
0000000 0005 0000 d006 0dea 6003 0000 0000 0000
0000010 0000 0000 4000 0000 01c2 0251 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000
The last trace point is equal to 0x6300, which confirms (due to backtrace) that firmware gets stuck somewhere in GoogleHotwordDspProcess()
. At the moment I do not know why, but maybe there is some problem with model data - it requires more debugging.
I believe that team is quite busy right now but we will see what we can do. Can we do a trace to verify that the memory it owns does not get tampered with?
If you would like to check values of specific variable you can do it using e.g. trace_points:
diff --git a/src/audio/google_hotword_detect.c b/src/audio/google_hotword_detect.c
index 8c46eaa785c8..1f3da320ebe3 100644
--- a/src/audio/google_hotword_detect.c
+++ b/src/audio/google_hotword_detect.c
@@ -343,6 +343,8 @@ static void ghd_detect(struct comp_dev *dev,
if (cd->detected)
return;
+ mailbox_sw_reg_write(SRAM_REG_DEBUG_WIN_1, bytes);
+
/* Assuming 1 channel, verified in ghd_params.
*
* TODO Make the logic multi channel safe when new hotword library can
diff --git a/src/include/kernel/mailbox.h b/src/include/kernel/mailbox.h
index c9fad0885b69..f6e707c8be02 100644
--- a/src/include/kernel/mailbox.h
+++ b/src/include/kernel/mailbox.h
@@ -36,7 +36,8 @@
#define SRAM_REG_R0_STATE_TRACE SRAM_REG_R_STATE_TRACE_BASE
#define SRAM_REG_R1_STATE_TRACE 0x38
#define SRAM_REG_R2_STATE_TRACE 0x40
-#define SRAM_REG_FW_END (SRAM_REG_R2_STATE_TRACE + 0x8)
+#define SRAM_REG_DEBUG_WIN_1 0x48
+#define SRAM_REG_FW_END (SRAM_REG_R2_STATE_TRACE + 0x4)
/** @}*/
And dump it by command:
hexdump /sys/kernel/debug/sof/regs
You can define bigger or several debug "windows" and dump/check bigger part of memory.
@sathya-nujella @johnylin76 @cujomalainey @bkokoszx I have a one line change which fixes quite some cache coherency issues when debugging multi-core + dynamic pipelines, can you help to give it quick try if it help: https://github.com/keyonjie/sof/commit/c5d156d6927bb2498d1da26fea5155e675d31375
@bkokoszx thanks for the helpful tip. Will definitely remember to try that out. @johnylin76 can you test @keyonjie's change?
Unfortunately by applying @keyonjie's change it is still easy to reproduce DSP panic issue by script (less than running 1 minute). Here are the logs: messages.txt sof-logger.txt
Thanks for trying it out @johnylin76 . From the FW logs we need PR: #4833 @bkokoszx
When I tried I also included PR#4833. According to https://github.com/thesofproject/sof/issues/4769#issuecomment-934368687 DSP panic issue still can be reproduced by script with PR#4833
When I tried I also included PR#4833. According to #4769 (comment) DSP panic issue still can be reproduced by script with PR#4833
Sorry didn't notice to this, so let's focus on @bkokoszx 's suggestion and debugging then.
I can also reproduce panic issue with script with @keyonjie change.
@cujomalainey @johnylin76 Did you find any time to look into this panic issue from Google perspective?
@bkokoszx I have not, I am sheriff for my team this week so my time is thin, @johnylin76 might have. I still think this is likely a data corruption issue caused by stressing malloc given the memory map works fine when not constantly being swapped. It might be worth re-enabling the CRC and seeing if we catch some CRC errors.
@cujomalainey Thanks for update
We are testing the CRAS workaround right now, but still underlying error exists
So our workaround prevents the crash, but we have a new crash that is blocking the release
@cujomalainey are both crashes solved now ? I've not seen any updates since the DMA fixes were merged so I'm assuming were are good ? @sathyap-chrome @sathya-nujella fyi.
@lgirdwood i believe this crash can still be replicated with the script. We simply modified CRAS so it won't hit it. It might be worth testing with another component that has a large config payload. (Note this payload is about 60k)
Describe the bug When IGO Noise Reduction was enabled and releases/tgl/v13.0-hotfix3 is applied on firmware. During recording with "Online Voice Recorder", there was DSP panic found when plugging/unplugging a headset.
To Reproduce
Note that script is not reproducible.
Reproduction Rate ~1/20
Expected behavior The recording is always valid.
Impact DSP panic occurred and user needs reboot to recover.
Environment Google private tgl-13 branch with the checkout of https://chrome-internal-review.googlesource.com/c/chromeos/third_party/sound-open-firmware-private/+/4116509/1 And with the latest IGO proprietary library
Screenshots or console output sof-audio-pci 0000:00:1f.3: error : DSP panic! sof-audio-pci 0000:00:1f.3: status: fw entered - code 00000005 sof-audio-pci 0000:00:1f.3: error: runtime exception sof-audio-pci 0000:00:1f.3: error: trace point 00004000 sof-audio-pci 0000:00:1f.3: error: panic at :0 sof-audio-pci 0000:00:1f.3: error: DSP Firmware Oops sof-audio-pci 0000:00:1f.3: error: Exception Cause: LoadStorePIFDataErrorCause, Synchronous PIF data error during LoadStore access sof-audio-pci 0000:00:1f.3: EXCCAUSE 0x0000000d EXCVADDR 0x0000bb88 PS 0x00060b25 SAR 0x00000000 sof-audio-pci 0000:00:1f.3: EPC1 0xbe02e437 EPC2 0xbe02d7e6 EPC3 0x00000000 EPC4 0x00000000
sof-audio-pci 0000:00:1f.3: EPC5 0x00000000 EPC6 0x00000000 EPC7 0x00000000 DEPC 0x00000000
sof-audio-pci 0000:00:1f.3: EPS2 0x00060120 EPS3 0x00000000 EPS4 0x00000000 EPS5 0x00000000
sof-audio-pci 0000:00:1f.3: EPS6 0x00000000 EPS7 0x00000000 INTENABL 0x00000000 INTERRU 0x00000262
sof-audio-pci 0000:00:1f.3: stack dump from 0xbe143d40 sof-audio-pci 0000:00:1f.3: 0xbe143d40: 00000000 3ff00000 3fe1feb3 41b00000 sof-audio-pci 0000:00:1f.3: 0xbe143d44: 00000000 00000000 00000000 00000000 sof-audio-pci 0000:00:1f.3: 0xbe143d48: 00000000 00000000 00000000 00000000 sof-audio-pci 0000:00:1f.3: 0xbe143d4c: f66ba030 ffff9667 01cebdc0 ffffaed4 sof-audio-pci 0000:00:1f.3: 0xbe143d50: 00000000 00000000 b641b700 ffffffff sof-audio-pci 0000:00:1f.3: 0xbe143d54: b05d7828 ffff9667 000c0800 00000000 sof-audio-pci 0000:00:1f.3: 0xbe143d58: 00000000 00000000 01cebe50 ffffaed4 sof-audio-pci 0000:00:1f.3: 0xbe143d5c: 01cebe78 ffffaed4 c05ac157 ffffffff
messages_panic_at_intl_mic.txt messages_panic_at_headset.txt