thesofproject / sof

Sound Open Firmware
Other
548 stars 315 forks source link

[BUG][TGL][Chrome] DSP Panic with large config payload (was plugging/unplugging a headset while using "Online Voice Recorder") #4769

Closed johnylin76 closed 1 year ago

johnylin76 commented 3 years ago

Describe the bug When IGO Noise Reduction was enabled and releases/tgl/v13.0-hotfix3 is applied on firmware. During recording with "Online Voice Recorder", there was DSP panic found when plugging/unplugging a headset.

To Reproduce

  1. Open "online voice recorder" on Chrome browser.
  2. Start recording and say something (input: internal mic)
  3. Plug in 3.5mm jack headset and say something (input: headset mic)
  4. Unplug headset and say something (input: internal mic)
  5. Repeat 3 and 4 again and again

Note that script is not reproducible.

Reproduction Rate ~1/20

Expected behavior The recording is always valid.

Impact DSP panic occurred and user needs reboot to recover.

Environment Google private tgl-13 branch with the checkout of https://chrome-internal-review.googlesource.com/c/chromeos/third_party/sound-open-firmware-private/+/4116509/1 And with the latest IGO proprietary library

Screenshots or console output sof-audio-pci 0000:00:1f.3: error : DSP panic! sof-audio-pci 0000:00:1f.3: status: fw entered - code 00000005 sof-audio-pci 0000:00:1f.3: error: runtime exception sof-audio-pci 0000:00:1f.3: error: trace point 00004000 sof-audio-pci 0000:00:1f.3: error: panic at :0 sof-audio-pci 0000:00:1f.3: error: DSP Firmware Oops sof-audio-pci 0000:00:1f.3: error: Exception Cause: LoadStorePIFDataErrorCause, Synchronous PIF data error during LoadStore access sof-audio-pci 0000:00:1f.3: EXCCAUSE 0x0000000d EXCVADDR 0x0000bb88 PS 0x00060b25 SAR 0x00000000 sof-audio-pci 0000:00:1f.3: EPC1 0xbe02e437 EPC2 0xbe02d7e6 EPC3 0x00000000 EPC4 0x00000000
sof-audio-pci 0000:00:1f.3: EPC5 0x00000000 EPC6 0x00000000 EPC7 0x00000000 DEPC 0x00000000
sof-audio-pci 0000:00:1f.3: EPS2 0x00060120 EPS3 0x00000000 EPS4 0x00000000 EPS5 0x00000000
sof-audio-pci 0000:00:1f.3: EPS6 0x00000000 EPS7 0x00000000 INTENABL 0x00000000 INTERRU 0x00000262
sof-audio-pci 0000:00:1f.3: stack dump from 0xbe143d40 sof-audio-pci 0000:00:1f.3: 0xbe143d40: 00000000 3ff00000 3fe1feb3 41b00000 sof-audio-pci 0000:00:1f.3: 0xbe143d44: 00000000 00000000 00000000 00000000 sof-audio-pci 0000:00:1f.3: 0xbe143d48: 00000000 00000000 00000000 00000000 sof-audio-pci 0000:00:1f.3: 0xbe143d4c: f66ba030 ffff9667 01cebdc0 ffffaed4 sof-audio-pci 0000:00:1f.3: 0xbe143d50: 00000000 00000000 b641b700 ffffffff sof-audio-pci 0000:00:1f.3: 0xbe143d54: b05d7828 ffff9667 000c0800 00000000 sof-audio-pci 0000:00:1f.3: 0xbe143d58: 00000000 00000000 01cebe50 ffffaed4 sof-audio-pci 0000:00:1f.3: 0xbe143d5c: 01cebe78 ffffaed4 c05ac157 ffffffff

messages_panic_at_intl_mic.txt messages_panic_at_headset.txt

keyonjie commented 3 years ago

@bkokoszx @mwasko @abonislawski FYI. unfortunately another one.

keyonjie commented 3 years ago

@johnylin76 do you have chance to capture the FW log? @1994lwz can you try to reproduce it?

lgirdwood commented 3 years ago

Prior to the crash we have a lot of IPC errors and attempts at setting a large config to component 57. What is component 57 ?

2021-09-15T05:38:36.475685Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:38:36.581500Z ERR kernel: [  853.364841] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:36.581518Z ERR kernel: [  853.364845] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:40.095015Z ERR kernel: [  856.876570] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:40.095029Z ERR kernel: [  856.876574] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:47.076017Z ERR kernel: [  863.854180] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:47.076031Z ERR kernel: [  863.854184] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:50.020019Z ERR kernel: [  866.796418] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:50.020033Z ERR kernel: [  866.796422] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:52.253016Z ERR kernel: [  869.028375] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:52.253032Z ERR kernel: [  869.028379] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:55.737015Z ERR kernel: [  872.510302] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:55.737029Z ERR kernel: [  872.510306] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:57.770043Z ERR kernel: [  874.542636] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:57.770057Z ERR kernel: [  874.542640] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:01.463014Z ERR kernel: [  878.233582] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:01.463028Z ERR kernel: [  878.233585] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:03.458016Z ERR kernel: [  880.227898] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:03.458029Z ERR kernel: [  880.227903] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:03.372964Z WARNING cras_server[1850]: message repeated 5 times: [ No chmap queried! Skip chmap set]
2021-09-15T05:39:04.521931Z INFO ARCVM(32)[4802]: [  349.525556] healthd: battery l=100 v=13080 t=0.0 h=1 st=4 c=0 fc=4085000 cc=0 chg=a
2021-09-15T05:39:06.815017Z ERR kernel: [  883.582875] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:06.815035Z ERR kernel: [  883.582879] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:08.682024Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:08.766045Z ERR kernel: [  885.532714] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:08.766057Z ERR kernel: [  885.532719] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:11.680015Z ERR kernel: [  888.445989] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:11.680029Z ERR kernel: [  888.445994] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:13.603026Z ERR kernel: [  890.367241] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:13.603040Z ERR kernel: [  890.367245] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:16.561020Z ERR kernel: [  893.323743] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:16.561042Z ERR kernel: [  893.323747] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:13.516979Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:17.221101Z INFO ARCVM(32)[4802]: [  362.218463] healthd: battery l=100 v=13080 t=0.0 h=1 st=4 c=0 fc=4085000 cc=0 chg=a
2021-09-15T05:39:19.047926Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:19.131016Z ERR kernel: [  895.892874] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:19.131032Z ERR kernel: [  895.892879] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:22.043015Z ERR kernel: [  898.803244] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:22.043030Z ERR kernel: [  898.803248] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:24.585019Z ERR kernel: [  901.344545] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:24.585036Z ERR kernel: [  901.344549] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:24.499112Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:29.857847Z NOTICE temp_logger[6248]:  x86_pkg_temp:43C INT3400_Thermal:20C TSR0:35C TSR1:33C TSR2:37C TSR3:35C TCPU:45C iwlwifi_1:34C PL1:17.000W
2021-09-15T05:39:30.401768Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:30.488021Z ERR kernel: [  907.244012] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:30.488034Z ERR kernel: [  907.244016] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:33.362018Z ERR kernel: [  910.116573] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:33.362032Z ERR kernel: [  910.116578] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:35.593991Z ERR kernel: [  912.347471] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:35.594012Z ERR kernel: [  912.347475] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:38.566053Z ERR kernel: [  915.318051] sof-audio-pci 0000:00:1f.3: error : DSP panic!
2021-09-15T05:39:38.566071Z ERR kernel: [  915.318070] sof-audio-pci 0000:00:1f.3: status: fw entered - code 00000005
2021-09-15T05:39:38.566071Z ERR kernel: [  915.318158] sof-audio-pci 0000:00:1f.3: error: runtime exception
2021-09-15T05:39:38.566072Z ERR kernel: [  915.318159] sof-audio-pci 0000:00:1f.3: error: trace point 00004000
2021-09-15T05:39:38.566073Z ERR kernel: [  915.318160] sof-audio-pci 0000:00:1f.3: error: panic at :0
2021-09-15T05:39:38.566074Z ERR kernel: [  915.318162] sof-audio-pci 0000:00:1f.3: error: DSP Firmware Oops
2021-09-15T05:39:38.566080Z ERR kernel: [  915.318163] sof-audio-pci 0000:00:1f.3: error: Exception Cause: LoadStorePIFDataErrorCause, Synchronous PIF data error during LoadStore access
2021-09-15T05:39:38.566081Z ERR kernel: [  915.318164] sof-audio-pci 0000:00:1f.3: EXCCAUSE 0x0000000d EXCVADDR 0x0000bb88 PS       0x00060b25 SAR     0x00000000
2021-09-15T05:39:38.566082Z ERR kernel: [  915.318165] sof-audio-pci 0000:00:1f.3: EPC1     0xbe02e437 EPC2     0xbe02d7e6 EPC3     0x00000000 EPC4    0x00000000
2021-09-15T05:39:38.566082Z ERR kernel: [  915.318167] sof-audio-pci 0000:00:1f.3: EPC5     0x00000000 EPC6     0x00000000 EPC7     0x00000000 DEPC    0x00000000
2021-09-15T05:39:38.566084Z ERR kernel: [  915.318168] sof-audio-pci 0000:00:1f.3: EPS2     0x00060120 EPS3     0x00000000 EPS4     0x00000000 EPS5    0x00000000
2021-09-15T05:39:38.566084Z ERR kernel: [  915.318169] sof-audio-pci 0000:00:1f.3: EPS6     0x00000000 EPS7     0x00000000 INTENABL 0x00000000 INTERRU 0x00000262
2021-09-15T05:39:38.566085Z ERR kernel: [  915.318170] sof-audio-pci 0000:00:1f.3: stack dump from 0xbe143d40
2021-09-15T05:39:38.566085Z ERR kernel: [  915.318172] sof-audio-pci 0000:00:1f.3: 0xbe143d40: 00000000 3ff00000 3fe1feb3 41b00000
2021-09-15T05:39:38.566086Z ERR kernel: [  915.318173] sof-audio-pci 0000:00:1f.3: 0xbe143d44: 00000000 00000000 00000000 00000000
2021-09-15T05:39:38.566086Z ERR kernel: [  915.318174] sof-audio-pci 0000:00:1f.3: 0xbe143d48: 00000000 00000000 00000000 00000000
2021-09-15T05:39:38.566087Z ERR kernel: [  915.318176] sof-audio-pci 0000:00:1f.3: 0xbe143d4c: f66ba030 ffff9667 01cebdc0 ffffaed4
2021-09-15T05:39:38.566088Z ERR kernel: [  915.318177] sof-audio-pci 0000:00:1f.3: 0xbe143d50: 00000000 00000000 b641b700 ffffffff
2021-09-15T05:39:38.566088Z ERR kernel: [  915.318178] sof-audio-pci 0000:00:1f.3: 0xbe143d54: b05d7828 ffff9667 000c0800 00000000
2021-09-15T05:39:38.566089Z ERR kernel: [  915.318179] sof-audio-pci 0000:00:1f.3: 0xbe143d58: 00000000 00000000 01cebe50 ffffaed4
2021-09-15T05:39:38.566090Z ERR kernel: [  915.318180] sof-audio-pci 0000:00:1f.3: 0xbe143d5c: 01cebe78 ffffaed4 c05ac157 ffffffff
lgirdwood commented 3 years ago

oh it appears the oops is after the set large config to comp 57 (see time delay)

2021-09-15T05:39:35.594012Z ERR kernel: [  912.347475] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:38.566053Z ERR kernel: [  915.318051] sof-audio-pci 0000:00:1f.3: error : DSP panic!

The other crashes show similar delays with previous IPC errors.

Can we reproduce with kernel IPC logs ON and FW logs ON ?

1994lwz commented 3 years ago

This issue is reproduce on the chrome platform at SHZ1 site.

[core dump]

# Core header:

# arch        00000001 # totalsize   0000017c # stackptr    be143d40 # stackoffset 000001a8
# configidhi  c2f3fbfe # configidlo  1cc6c29b # numaregs    00000040

# CPU registers:

# exccause    0000000d # excvaddr    0000bb88 # ps          00060225
# epc1        be02e437 # epc2        be02d7e6 # epc3        00000000 # epc4        00000000
# epc5        00000000 # epc6        00000000 # epc7        00000000
# eps2        00060a20 # eps3        00000000 # eps4        00000000 # eps5        00000000
# eps6        00000000 # eps7        00000000
# depc        00000000 # intenable   00000000 # interrupt   00000262 # sar         00000000
# debugcause  00000000
# windowbase  0000000e # windowstart 00005400
# excsave1    00000000
# ar56        be04a59f # ar57        be143d50 # ar58        be00881c # ar59        1cc6c29b
# ar60        c2f3fbfe # ar61        0000017c # ar62        00000001 # ar63        be143d40
# ar0         be04a51f # ar1         be143d30 # ar2         be008f28 # ar3         00000016
# ar4         00000028 # ar5         00000538 # ar6         9e266400 # ar7         00000004
# ar8         be03ec8b # ar9         be143d00 # ar10        be0089a8 # ar11        be143d40
# ar12        00000538 # ar13        00000800 # ar14        be2d2480 # ar15        00000003
# ar16        be03ebfb # ar17        be143ce0 # ar18        be0089a8 # ar19        be144278
# ar20        00000000 # ar21        be008ee0 # ar22        be02da67 # ar23        be146b30
# ar24        be03ebfb # ar25        be143cc0 # ar26        be00a670 # ar27        00000000
# ar28        00000000 # ar29        be00ae70 # ar30        00000005 # ar31        00000005
# ar32        be144170 # ar33        be144150 # ar34        be2d8e00 # ar35        00000000
# ar36        be144180 # ar37        00000000 # ar38        be03dc34 # ar39        00060826
# ar40        fe04fd68 # ar41        be143d90 # ar42        be143dc0 # ar43        0000000d
# ar44        00000000 # ar45        9e2664d0 # ar46        9e266480 # ar47        be2d8e4c
# ar48        be02e449 # ar49        be143d60 # ar50        0dead006 # ar51        00000000
# ar52        be143d90 # ar53        00000000 # ar54        00000001 # ar55        00000001

# windowbase: E
#               0 1 2
# windowstart: b101010000000000

#      reg         a0         a1
#                  (return)   (sptr)
#      ---         --------   -------
#  0 # ar56        be04a59f   be143d50
#  1 # ar48        be02e449   be143d60
#  2 # ar40        fe04fd68   be143d90

# Stack dumped from be143d40 dwords num 406

# *EXCEPTION*

# exccause: LoadStorePIFDataErrorCause: ynchronous PIF data error during LoadStore access
# excvaddr: 48008
johnylin76 commented 3 years ago

@johnylin76 do you have chance to capture the FW log?

Here are the logs with kernel IPC logs ON and FW logs ON: sof-logger.txt messages.txt

lgirdwood commented 3 years ago

The logs show 2 things 1) What appears to be heavy load on the DSP, as the scheduler cant shedule the agent on time ?

TIMESTAMP         (us)              DELTA  C# COMPONENT          LOCATION                      CONTENT  ktime=370.318s  @  2021-09-17 17:42:59 +0800 CST

Found valid LDC address after skipping 48 bytes (one line uses 20 + 0 to 16 bytes)
[           0.000000] (           0.000000) c0 unknown      0.0            src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51426
[   231528321.320724] (   231528320.000000) c0 sa                          src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51273
[   231568316.631635] (       39995.312500) c0 sa                          src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51290
[   231608321.109212] (       40004.476562) c0 sa                          src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51377
[   231648337.305539] (       40016.195312) c0 sa                          src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51574
[   231688318.553950] (       39981.250000) c0 sa                          src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51322
[   231728324.073194] (       40005.519531) c0 sa                          src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51435

But the delay is consistent, meaning we are likely scheduling agent in a timely manner but perhaps with the wrong clock or timebase.

2) We are doing a lot (~200) IPC configuration messages prior to the crash.

2021-09-17T09:38:57.731047Z DEBUG kernel: [  128.874270] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.732027Z DEBUG kernel: [  128.874692] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.732037Z DEBUG kernel: [  128.874705] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.732040Z DEBUG kernel: [  128.875252] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.732050Z DEBUG kernel: [  128.875301] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.733017Z DEBUG kernel: [  128.875705] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.733030Z DEBUG kernel: [  128.875716] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.733037Z DEBUG kernel: [  128.876259] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.733040Z DEBUG kernel: [  128.876299] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.734022Z DEBUG kernel: [  128.876699] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.734030Z DEBUG kernel: [  128.876710] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.734040Z DEBUG kernel: [  128.877260] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.734042Z DEBUG kernel: [  128.877293] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.735041Z DEBUG kernel: [  128.877708] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.735044Z DEBUG kernel: [  128.877730] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.736014Z ERR kernel: [  128.878698] sof-audio-pci 0000:00:1f.3: error : DSP panic!
2021-09-17T09:38:57.736019Z DEBUG kernel: [  128.878702] sof-audio-pci 0000:00:1f.3: panic: dsp_oops_offset 788480 offset 788480
2021-09-17T09:38:57.736020Z ERR kernel: [  128.878718] sof-audio-pci 0000:00:1f.3: status: fw entered - code 00000005
2021-09-17T09:38:57.736021Z ERR kernel: [  128.878806] sof-audio-pci 0000:00:1f.3: error: runtime exception

We are not doing anything else at this time so could be a race when we are copying config data.

johnylin76 commented 3 years ago

This issue is reproduce on the chrome platform at SHZ1 site.

Hi @1994lwz , may I know which device do you use? Delbin? Drobit? Do you know the phase (Proto/DVT/EVT) and the SKU of your device? Thanks.

1994lwz commented 3 years ago

This issue is reproduce on the chrome platform at SHZ1 site.

Hi @1994lwz , may I know which device do you use? Delbin? Drobit? Do you know the phase (Proto/DVT/EVT) and the SKU of your device? Thanks.

the device is Delbin EVT SKU2.

lgirdwood commented 3 years ago

@1994lwz @bkokoszx any update or findings so far ?

bkokoszx commented 3 years ago

@lgirdwood Not yet, I was debugging problem with D0i3->D0 multicore transition.

Is there any way to reproduce this issue by cmd line?

johnylin76 commented 3 years ago

Unfortunately I still cannot make a script to successfully reproduce the issue with "cras_test_client --plug ::<0|1>" commands. It seems like a complicated problem that cannot be created by the simple combination of commands.

However, I can still reproduce the issue on my Drobit device by the manual steps, although it is also not easy to make it happen. My strategy is rebooting the device and trying again when I failed to reproduce the issue after 30~40 times of headset plug-and-unplug iterations. (I rebooted 3 times and finally hit it...)

Attached logs: messages.txt sof-logger.txt

The log from sof-logger looks weird and unfinished, but it is indeed the whole log I could retrieve after hitting the issue.

bkokoszx commented 3 years ago

@johnylin76 Are you able to provide core dump file (after reproduction) with corresponding firmware .lst file (sof-tgl.lst - one of a building artifact)?

johnylin76 commented 3 years ago

@bkokoszx Do you mean the core dump file generated by the method as mentioned in sof/issues/4648? I have generated one. Please see as attached:

messages.txt sof-logger.txt oops_coredump.txt

bkokoszx commented 3 years ago

@johnylin76 Yes, but additionally .lst file will be helpful (sof-tgl.lst). It contains disassembly of the code, so that we can associate addresses from coredump file provided by you with the code. .lst file is one of the build artifacts (in /src/arch/xtensa/).

ranj063 commented 3 years ago

here's a tentative fix for this issue: https://github.com/thesofproject/sof/pull/4833

Still being tested

cujomalainey commented 3 years ago

here's a tentative fix for this issue: #4833

Where would the crash happen here? I am agree it is a good fix but I am having trouble reasoning out the race that would leave the lib with either freed memory or an incorrect size.

ranj063 commented 3 years ago

Where would the crash happen here? I am agree it is a good fix but I am having trouble reasoning out the race that would leave the lib with either freed memory or an incorrect size.

@cujomalainey I'm not certain what could cause the crash given that we're likely sending the same blob. So, theoretically the size should remain the same. But from the logs we see that there are 2 blobs (or the same blob senttwice) sent during headset removal/pug-in. Do you know what those 2 blobs are for?

cujomalainey commented 3 years ago

@ranj063 we know they are the hotword models, but something is being weird in how they are sent. Previously on 012 this was not an issue. FYI i now have a script you can run to repro this. Repro times range from 5s to 10min

cras_test_client --capture_f /dev/null &
cras_test_client --listen /dev/null &
while true; do cras_test_client --set_hotword_model 15:0:null & cras_test_client --set_hotword_model 15:0:en_all; done
ranj063 commented 3 years ago

@cujomalainey do you mind trying my PR to see if the panic still hapens on your device?

cujomalainey commented 3 years ago

I don't have the latest IGO lib. @johnylin76 can test. He will be coming online soon.

ranj063 commented 3 years ago

I don't have the latest IGO lib. @johnylin76 can test. He will be coming online soon.

with my PR i'm able to hit the issue much sooner than without

bkokoszx commented 3 years ago

@ranj063 I've tried to reproduce panic remotely using @cujomalainey script and I did not hit reproduction. Only I saw ipc errors:

[  290.425084] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
[  290.425086] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
[  290.426537] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
[  290.426539] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57

Update: I think I hit the repro. I will double check and let you know.

cujomalainey commented 3 years ago

According to @bzhg when he was checking crash, the actual crash is inside the microspeech lib on a load, but he can't tell what it is loading. I am betting we are pulling memory out from under its feet and causing it to jump into oblivion. As to how I am not sure.

ranj063 commented 3 years ago

According to @bzhg when he was checking crash, the actual crash is inside the microspeech lib on a load, but he can't tell what it is loading. I am betting we are pulling memory out from under its feet and causing it to jump into oblivion. As to how I am not sure.

@cujomalainey do you think that instead of realloc'ing the blob_handler->data pointer to a new pointer everytime, allocating data with a max size and only updating it would help solve the issue?

cujomalainey commented 3 years ago

@ranj063 it would be an interesting data point. I wonder if maybe we need to add trace errors and flushes to see if we are maybe hitting those paths on a crash

bkokoszx commented 3 years ago

It loos like with @ranj063 change (#4833) still I can reproduce DSP panic. The backtrace retrieved from coredumper is as follows:

#0  0xbe05fd00 in literals ()
#1  0xbe049a6f in dump_stack (p=3187705884, addr=0x1cc6c29b, offset=3270769662, limit=380, stack_ptr=0x1) at /home/admin/Documents/sof/sof/src/arch/xtensa/include/arch/lib/cache.h:79
#2  0xbe049aef in panic_dump (p=233492486, panic_info=0x0, data=0xbe142d90 <primary_core_data+3216>) at /home/admin/Documents/sof/sof/src/arch/xtensa/include/arch/debug/panic.h:45
#3  0xbe02e449 in exception () at /home/admin/Documents/sof/sof/src/arch/xtensa/init.c:115
#4  0xbe04f2e8 in _GeneralException ()
#5  0xbe06069b in GoogleHotwordProcessSamples ()
#6  0xbe04cfc3 in GoogleHotwordDspProcess ()
#7  0xbe03963b in ghd_detect (dev=0xbe2d8d00, stream=0x2, samples=0xbe177ac8, bytes=640) at /home/admin/Documents/sof/sof/src/audio/google_hotword_detect.c:361
#8  0xbe039734 in ghd_copy (dev=0xbe2d8d00) at /home/admin/Documents/sof/sof/src/audio/google_hotword_detect.c:426
#9  0xbe03dbc2 in comp_copy (dev=0xbe2d8d00) at /home/admin/Documents/sof/sof/src/include/sof/audio/component_ext.h:203
#10 0xbe03dcf2 in pipeline_comp_copy (current=0xbe2d8d00, calling_buf=0xbe143180 <primary_core_data+4224>, ctx=0xbe143180 <primary_core_data+4224>, dir=0)
#11 0xbe03db05 in pipeline_for_each_comp (current=0xbe2d8d00, ctx=0xbe143180 <primary_core_data+4224>, dir=0) at /home/admin/Documents/sof/sof/src/audio/pipeline/pipeline-graph.c:431
#12 0xbe03dcde in pipeline_comp_copy (current=0xbe2d8e00, calling_buf=0xbe143180 <primary_core_data+4224>, ctx=0xbe143180 <primary_core_data+4224>, dir=0)
#13 0xbe03dd5c in pipeline_copy (p=0xbe2d2500) at /home/admin/Documents/sof/sof/src/audio/pipeline/pipeline-stream.c:165
#14 0xbe03e7b3 in pipeline_task (arg=0xbe2d2500) at /home/admin/Documents/sof/sof/src/audio/pipeline/pipeline-schedule.c:58
#15 0xbe04aa61 in schedule_ll_tasks_execute (sch=0xbe250180) at /home/admin/Documents/sof/sof/src/include/sof/schedule/task.h:88
#16 0xbe04ab51 in schedule_ll_tasks_run (data=0xbe250180) at /home/admin/Documents/sof/sof/src/schedule/ll_schedule.c:231
#17 0xbe041cfd in irq_lvl2_level2_handler (data=0x400000) at /home/admin/Documents/sof/sof/src/drivers/intel/cavs/interrupt.c:87
#18 0xbe04fe20 in _Level2FromVector ()
#19 0xbe02d7e6 in arch_wait_for_interrupt (level=0) at /home/admin/Documents/sof/sof/src/arch/xtensa/include/arch/lib/wait.h:40
johnylin76 commented 3 years ago

Hi @bkokoszx do you reproduce by script or manual plug/unplug headset? I adapted #4483 commit and tried to reproduce by manual plug/unplug method in these 2 days, but I didn't hit the panic issue anymore. (however it could just be the luck..)

ranj063 commented 3 years ago

@johnylin76 is the PR nunmber correct? 4483 seems related to zephyr?

bkokoszx commented 3 years ago

@johnylin76 I was trying to reproduce by script.

sathya-nujella commented 3 years ago

Hi @bkokoszx do you reproduce by script or manual plug/unplug headset? I adapted #4483 commit and tried to reproduce by manual plug/unplug method in these 2 days, but I didn't hit the panic issue anymore. (however it could just be the luck..)

My observation matches with @johnylin76 's. I applied only PR: https://github.com/thesofproject/sof/pull/4833 .

I can not reproduce "DSP panic issue" in manual test.

Manual test steps are same as the one mentioned in bug description + GHW running in background..

johnylin76 commented 3 years ago

Sorry I mean #4833, and same test steps as @sathya-nujella (as bug description + GHW running in background)

mwasko commented 3 years ago

4 0xbe04f2e8 in _GeneralException ()

5 0xbe06069b in GoogleHotwordProcessSamples ()

6 0xbe04cfc3 in GoogleHotwordDspProcess ()

7 0xbe03963b in ghd_detect (dev=0xbe2d8d00, stream=0x2, samples=0xbe177ac8, bytes=640) at /home/admin/Documents/sof/sof/src/audio/google_hotword_detect.c:361

8 0xbe039734 in ghd_copy (dev=0xbe2d8d00) at /home/admin/Documents/sof/sof/src/audio/google_hotword_detect.c:426

9 0xbe03dbc2 in comp_copy (dev=0xbe2d8d00) at /home/admin/Documents/sof/sof/src/include/sof/audio/component_ext.h:203

@johnylin76, @cujomalainey based on coredump backtrace the panic occurs in Google Hotword module on sample processing. Can we assign Google Hotword developer to involve in issue debug and help to find issue root cause?

bkokoszx commented 3 years ago

I've added following trace points to the code:

diff --git a/src/audio/google_hotword_detect.c b/src/audio/google_hotword_detect.c
index a2b5a18..1052fe5 100644
--- a/src/audio/google_hotword_detect.c
+++ b/src/audio/google_hotword_detect.c
@@ -235,6 +235,8 @@ static int ghd_ctrl_set_bin_data(struct comp_dev *dev,
        struct comp_data *cd = comp_get_drvdata(dev);
        int ret;

+       trace_point(0x5000);
+
        switch (cdata->data->type) {
        case GOOGLE_HOTWORD_DETECT_MODEL:
                // comp_info(dev, "ghd_ctrl_set_bin_data(): comp_data_blob_set_cmd");
@@ -247,6 +249,8 @@ static int ghd_ctrl_set_bin_data(struct comp_dev *dev,
                         cdata->data->type);
                return -EINVAL;
        }
+
+       trace_point(0x5001);
 }

 static int ghd_ctrl_set_data(struct comp_dev *dev,
@@ -341,6 +345,8 @@ static void ghd_detect(struct comp_dev *dev,
        uint32_t sample_bytes;
        int ret;

+       trace_point(0x6002);
+
        if (cd->detected)
                return;

@@ -356,10 +362,14 @@ static void ghd_detect(struct comp_dev *dev,
                cd->history_bytes += bytes;
        }

+       trace_point(0x6003);
+
        comp_dbg(dev, "GoogleHotwordDspProcess(0x%x, %u)",
                 (uint32_t)samples, bytes / sample_bytes);
        ret = GoogleHotwordDspProcess(samples, bytes / sample_bytes,
                                      &preamble_length_ms);
+       trace_point(0x6004);
+
        if (ret == 1) {
                cd->detected = 1;

@@ -390,6 +400,8 @@ static int ghd_copy(struct comp_dev *dev)
        uint32_t bytes, tail_bytes, head_bytes = 0;
        int ret;

+       trace_point(0x6000);
+
        comp_info(dev, "ghd_copy()");

        /* Check for new model */
@@ -424,6 +436,8 @@ static int ghd_copy(struct comp_dev *dev)
        else
                head_bytes = bytes - tail_bytes;

+       trace_point(0x6001);
+
        if (tail_bytes)
                ghd_detect(dev, stream, stream->r_ptr, tail_bytes);
        if (head_bytes)
@@ -432,6 +446,8 @@ static int ghd_copy(struct comp_dev *dev)
        /* calc new available */
        comp_update_buffer_consume(source, bytes);

+       trace_point(0x6005);
+
        return 0;
 }

And after reproduction with script: https://github.com/thesofproject/sof/issues/4769#issuecomment-931755456 and firmware with applied #4833, I got following regs hexdump:

localhost /sys/kernel/debug/sof # hexdump regs
0000000 0005 0000 d006 0dea 6003 0000 0000 0000
0000010 0000 0000 4000 0000 01c2 0251 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000

The last trace point is equal to 0x6300, which confirms (due to backtrace) that firmware gets stuck somewhere in GoogleHotwordDspProcess(). At the moment I do not know why, but maybe there is some problem with model data - it requires more debugging.

cujomalainey commented 3 years ago

I believe that team is quite busy right now but we will see what we can do. Can we do a trace to verify that the memory it owns does not get tampered with?

bkokoszx commented 3 years ago

If you would like to check values of specific variable you can do it using e.g. trace_points:

diff --git a/src/audio/google_hotword_detect.c b/src/audio/google_hotword_detect.c
index 8c46eaa785c8..1f3da320ebe3 100644
--- a/src/audio/google_hotword_detect.c
+++ b/src/audio/google_hotword_detect.c
@@ -343,6 +343,8 @@ static void ghd_detect(struct comp_dev *dev,
        if (cd->detected)
                return;

+       mailbox_sw_reg_write(SRAM_REG_DEBUG_WIN_1, bytes);
+
        /* Assuming 1 channel, verified in ghd_params.
         *
         * TODO Make the logic multi channel safe when new hotword library can
diff --git a/src/include/kernel/mailbox.h b/src/include/kernel/mailbox.h
index c9fad0885b69..f6e707c8be02 100644
--- a/src/include/kernel/mailbox.h
+++ b/src/include/kernel/mailbox.h
@@ -36,7 +36,8 @@
 #define SRAM_REG_R0_STATE_TRACE                        SRAM_REG_R_STATE_TRACE_BASE
 #define SRAM_REG_R1_STATE_TRACE                        0x38
 #define SRAM_REG_R2_STATE_TRACE                        0x40
-#define SRAM_REG_FW_END                                (SRAM_REG_R2_STATE_TRACE + 0x8)
+#define SRAM_REG_DEBUG_WIN_1        0x48
+#define SRAM_REG_FW_END                                (SRAM_REG_R2_STATE_TRACE + 0x4)

 /** @}*/

And dump it by command:

 hexdump /sys/kernel/debug/sof/regs

You can define bigger or several debug "windows" and dump/check bigger part of memory.

keyonjie commented 3 years ago

@sathya-nujella @johnylin76 @cujomalainey @bkokoszx I have a one line change which fixes quite some cache coherency issues when debugging multi-core + dynamic pipelines, can you help to give it quick try if it help: https://github.com/keyonjie/sof/commit/c5d156d6927bb2498d1da26fea5155e675d31375

cujomalainey commented 3 years ago

@bkokoszx thanks for the helpful tip. Will definitely remember to try that out. @johnylin76 can you test @keyonjie's change?

johnylin76 commented 3 years ago

Unfortunately by applying @keyonjie's change it is still easy to reproduce DSP panic issue by script (less than running 1 minute). Here are the logs: messages.txt sof-logger.txt

keyonjie commented 3 years ago

Thanks for trying it out @johnylin76 . From the FW logs we need PR: #4833 @bkokoszx

johnylin76 commented 3 years ago

When I tried I also included PR#4833. According to https://github.com/thesofproject/sof/issues/4769#issuecomment-934368687 DSP panic issue still can be reproduced by script with PR#4833

keyonjie commented 3 years ago

When I tried I also included PR#4833. According to #4769 (comment) DSP panic issue still can be reproduced by script with PR#4833

Sorry didn't notice to this, so let's focus on @bkokoszx 's suggestion and debugging then.

bkokoszx commented 3 years ago

I can also reproduce panic issue with script with @keyonjie change.

bkokoszx commented 3 years ago

@cujomalainey @johnylin76 Did you find any time to look into this panic issue from Google perspective?

cujomalainey commented 3 years ago

@bkokoszx I have not, I am sheriff for my team this week so my time is thin, @johnylin76 might have. I still think this is likely a data corruption issue caused by stressing malloc given the memory map works fine when not constantly being swapped. It might be worth re-enabling the CRC and seeing if we catch some CRC errors.

bkokoszx commented 3 years ago

@cujomalainey Thanks for update

cujomalainey commented 2 years ago

We are testing the CRAS workaround right now, but still underlying error exists

cujomalainey commented 2 years ago

So our workaround prevents the crash, but we have a new crash that is blocking the release

lgirdwood commented 2 years ago

@cujomalainey are both crashes solved now ? I've not seen any updates since the DMA fixes were merged so I'm assuming were are good ? @sathyap-chrome @sathya-nujella fyi.

cujomalainey commented 2 years ago

@lgirdwood i believe this crash can still be replicated with the script. We simply modified CRAS so it won't hit it. It might be worth testing with another component that has a large config payload. (Note this payload is about 60k)