Closed keyonjie closed 3 years ago
it looks sometimes we can't take the DSP out from stalled state at module reload, and then dsp boot fails.
I have tried ~100 kernel builds/tries, but still can't figure out fix for this, the CSR_PWAITMODE bit will have high possibility to be 1 which means we can't take the DSP out from stalled state after module unloaded/reloaded, this issue only happens on BSW but on BYT or CHT.
I also tried SST ATOM driver, which didn't support module unload/reload.
looks the PCI config register is inaccessable, I can't set the DSP to D3hot state.
@keyonjie have you tried on other non-chrome CHT devices? I can't recall seeing those problems.
@plbossart no, the BSW_CYN_MAX98090 is the only BSW I have ever touched.
BTW, I have a fix for this issue now, basically I need to send IPC to FW to inform about this driver module removal, working on refining it.
@plbossart no, the BSW_CYN_MAX98090 is the only BSW I have ever touched.
BTW, I have a fix for this issue now, basically I need to send IPC to FW to inform about this driver moduel removal, working on refining it.
if these tests work on MinnowBoard and other BYT devices, I am not sure why the extra IPC is needed. For me the issues on Cyan where due to GPIO stuff.
@plbossart no, the BSW_CYN_MAX98090 is the only BSW I have ever touched. BTW, I have a fix for this issue now, basically I need to send IPC to FW to inform about this driver moduel removal, working on refining it.
if these tests work on MinnowBoard and other BYT devices, I am not sure why the extra IPC is needed.
That's true, though I think the extra IPC sounds reasonable, but it does require many regression test efforts on all platforms as this is a core change.
For me the issues on Cyan where due to GPIO stuff.
Um, can you get some information about this? It is an old platform, might be difficult to get support or information about it.
Call for helps here, we might need to get more BSW-CYAN platforms and more developers can involve the debugging, as the issue is much more complicated than expected, basically, there are several kinds of issues behind it:
[ 226.344251] INFO: trying to register non-static key.
[ 226.344260] the code is fine but needs lockdep annotation.
[ 226.344263] turning off the locking correctness validator.
[ 226.344269] CPU: 0 PID: 5220 Comm: rmmod Not tainted 5.4.0-rc7+ #5
[ 226.344272] Hardware name: GOOGLE Cyan/Cyan, BIOS MrChromebox-4.10 08/22/2019
[ 226.344275] Call Trace:
[ 226.344286] dump_stack+0x71/0xa0
[ 226.344294] register_lock_class+0x549/0x550
[ 226.344300] __lock_acquire+0x71/0x1910
[ 226.344307] ? __mutex_lock+0xa0/0x970
[ 226.344311] lock_acquire+0x95/0x190
[ 226.344317] ? init_timer_key+0x110/0x110
[ 226.344322] del_timer_sync+0x2f/0x90
[ 226.344326] ? init_timer_key+0x110/0x110
[ 226.344332] flush_delayed_work+0x13/0x40
[ 226.344355] soc_pcm_private_free+0x17/0x20 [snd_soc_core]
[ 226.344368] snd_pcm_free+0x1a/0x50 [snd_pcm]
[ 226.344379] __snd_device_free+0x46/0x60 [snd]
[ 226.344387] snd_device_free_all+0x3a/0x70 [snd]
[ 226.344395] release_card_device+0x14/0x50 [snd]
[ 226.344401] device_release+0x23/0x80
[ 226.344406] kobject_put+0x84/0x1a0
[ 226.344414] snd_card_free+0x5b/0x90 [snd]
[ 226.344430] soc_cleanup_card_resources+0xbb/0x2c0 [snd_soc_core]
[ 226.344445] snd_soc_unbind_card+0x9f/0xe0 [snd_soc_core]
[ 226.344459] snd_soc_unregister_card+0x1f/0x60 [snd_soc_core]
[ 226.344465] release_nodes+0x1ad/0x200
[ 226.344471] device_release_driver_internal+0xef/0x1c0
[ 226.344475] bus_remove_device+0xed/0x160
[ 226.344479] device_del+0x15e/0x370
[ 226.344484] platform_device_del.part.0+0xe/0x60
[ 226.344488] platform_device_unregister+0x17/0x30
[ 226.344499] snd_sof_device_remove+0x4c/0xa0 [snd_sof]
[ 226.344505] sof_acpi_remove+0x2f/0x40 [snd_sof_acpi]
[ 226.344509] platform_drv_remove+0x1f/0x40
[ 226.344514] device_release_driver_internal+0xdf/0x1c0
[ 226.344518] driver_detach+0x42/0x80
[ 226.344522] bus_remove_driver+0x56/0xca
[ 226.344527] __x64_sys_delete_module+0x18f/0x240
[ 226.344533] ? trace_hardirqs_off_thunk+0x1a/0x20
[ 226.344537] do_syscall_64+0x4b/0x1b0
[ 226.344541] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 226.344546] RIP: 0033:0x7f986e3c01b7
[ 226.344551] Code: 73 01 c3 48 8b 0d d1 8c 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a1 8c 2c 00 f7 d8 64 89 01 48
[ 226.344554] RSP: 002b:00007ffde9e5a6e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 226.344560] RAX: ffffffffffffffda RBX: 00007ffde9e5a748 RCX: 00007f986e3c01b7
[ 226.344562] RDX: 000000000000000a RSI: 0000000000000800 RDI: 00005628ea7ec7d8
[ 226.344565] RBP: 00005628ea7ec770 R08: 00007ffde9e59661 R09: 0000000000000000
[ 226.344568] R10: 00007f986e43ccc0 R11: 0000000000000206 R12: 00007ffde9e5a910
[ 226.344571] R13: 00007ffde9e5b620 R14: 00005628ea7ec260 R15: 00005628ea7ec770
@keyonjie can you post your patches so far so we can help?
@keyonjie can you post your patches so far so we can help?
Sure, doing that this morning.
@ranj063 @wenqingfu with my PR #1489 #1490 and #1491 applied, the hang issue(the "1" issue I mentioned above) is fixed.
For the "2" issue, we need add IPC_CTX_SAVE sending at driver removal.
For "3", need debugging from FW side to see what error happened there.
For "4", need to get a good base from Pierre and Morimoto, and debugging in the driver side.
For the "2" issue, we need add IPC_CTX_SAVE sending at driver removal. @keyonjie what is it that this IPC does that allows the CSR_PWAITMODE reset? Seding the CTX_SAVE IPC during device removal is the opposite of what we do normally which is disable PM and bring the device back to full-power
For the "2" issue, we need add IPC_CTX_SAVE sending at driver removal. @keyonjie what is it that this IPC does that allows the CSR_PWAITMODE reset? Seding the CTX_SAVE IPC during device removal is the opposite of what we do normally which is disable PM and bring the device back to full-power
We will keep the CTX_RESTORE and bring the device back to full-power, here what i mentioned has no relationship with the PM ops, it is in the module removal, e.g. snd_sof_device_remove().
But what you mentioned inspires me about why the issue happens randomly, so my guess is that if we remove the module when the device is in runtime-suspended (which means CTX_SAVE already sent to FW), we will get the CSR_PWAITMODE reset correctly, but if we remove the module when the device is in runtime-active, then we will hit this "2" issue -- CSR_PWAITMODE can't be reset to 0.
@ranj063 RFC PR for "2" is here: https://github.com/thesofproject/linux/pull/1493
For the "2" issue, we need add IPC_CTX_SAVE sending at driver removal. @keyonjie what is it that this IPC does that allows the CSR_PWAITMODE reset? Seding the CTX_SAVE IPC during device removal is the opposite of what we do normally which is disable PM and bring the device back to full-power
We will keep the CTX_RESTORE and bring the device back to full-power, here what i mentioned has no relationship with the PM ops, it is in the module removal, e.g. snd_sof_device_remove().
But what you mentioned inspires me about why the issue happens randomly, so my guess is that if we remove the module when the device is in runtime-suspended (which means CTX_SAVE already sent to FW), we will get the CSR_PWAITMODE reset correctly, but if we remove the module when the device is in runtime-active, then we will hit this "2" issue -- CSR_PWAITMODE can't be reset to 0.
there is no runtime suspend (yet) on Baytrail/Cherrytrail.
there is no runtime suspend (yet) on Baytrail/Cherrytrail.
@plbossart I think that's not fully true, we already have the runtime PM ops in sof-acpi-dev.c, what we are lacking is the platform runtime PM ops only, so I think the sof_suspend() is being called and SOF_IPC_PM_CTX_SAVE is sent to FW there already.
there is no runtime suspend (yet) on Baytrail/Cherrytrail.
@plbossart I think that's not fully true, we already have the runtime PM ops in sof-acpi-dev.c, what we are lacking is the platform runtime PM ops only, so I think the sof_suspend() is being called and SOF_IPC_PM_CTX_SAVE is sent to FW there already.
@keyonjie Sorry thats incorrect. look at https://github.com/thesofproject/linux/blob/fd30230359e5b9106d69eb3cf30ef987921d2cee/sound/soc/sof/pm.c#L341
We do nothing if there's no suspend callback, which is true for BYT
there is no runtime suspend (yet) on Baytrail/Cherrytrail.
@plbossart I think that's not fully true, we already have the runtime PM ops in sof-acpi-dev.c, what we are lacking is the platform runtime PM ops only, so I think the sof_suspend() is being called and SOF_IPC_PM_CTX_SAVE is sent to FW there already.
@keyonjie Sorry thats incorrect. look at
We do nothing if there's no suspend callback, which is true for BYT
Oo, that's right, I didn't realize we even don't have .suspend() ops for BYT.
@ranj063 we do need a clean up here for byt, adding PM ops, runtime PM ops, .remove() ops.
And, we need do the similar things for bdw.c also.
@ranj063 we do need a clean up here for byt, adding PM ops, runtime PM ops, .remove() ops.
@keyonjie yes, i've been looking into it but didnt have a system to try. Once I get my BSW, I will work on it
there is no runtime suspend (yet) on Baytrail/Cherrytrail.
@plbossart I think that's not fully true, we already have the runtime PM ops in sof-acpi-dev.c, what we are lacking is the platform runtime PM ops only, so I think the sof_suspend() is being called and SOF_IPC_PM_CTX_SAVE is sent to FW there already.
@keyonjie Sorry thats incorrect. look at https://github.com/thesofproject/linux/blob/fd30230359e5b9106d69eb3cf30ef987921d2cee/sound/soc/sof/pm.c#L341
We do nothing if there's no suspend callback, which is true for BYT
Oo, that's right, I didn't realize we even don't have .suspend() ops for BYT.
And here is actually tricky, imagine the case we have platform .runtime_suspend() but .suspend(), we should still continue to function instead of return?
@keyonjie the assertion that we have a problem in general for module load/unload is incorrect. I just ran topic/sof-dev w/ the latest firmware on a CHT device w/ rt5640. No issues.
The only issues I have is kernel oopses with the known topology/dai unbind, and interestingly PR #1469 does not work - but that's a side issue.
So please clarify what kernel/firmware you are using and which platforms you tested. If you test on Cyan alone then you don't have enough test coverage, and your conclusions will be invalid.
We may very well have a side issue that's masked on other platforms, but we have to root cause and work in steps.
@plbossart testing it on CHT now...
@plbossart both CHT and minnow board have the similar risk I described on https://github.com/thesofproject/linux/issues/1492, we got multiple entries to the interrupt handler for a single interrupt, though it looks fine at module unload/reload on them. see the dmesg logs below:
On CHT:
[ 6.469040] sof-audio-acpi 808622A8:00: ipc tx: 0x30030000
[ 6.469065] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469091] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469106] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x4000000000000000, ipcd:0x70028800
[ 6.469113] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469119] Keyon: byt_irq_thread, 216
[ 6.469156] Keyon: byt_irq_handler, 192, isr:0x0
[ 6.469171] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x0, ipcd:0x70028800
[ 6.469175] sof-audio-acpi 808622A8:00: ipc tx succeeded: 0x30030000
[ 6.469191] sof-audio-acpi 808622A8:00: sink PGA3.0 control none source BUF3.0
[ 6.469202] sof-audio-acpi 808622A8:00: ipc tx: 0x30030000
[ 6.469223] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469252] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469268] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x4000000000000000, ipcd:0x70028800
[ 6.469276] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469282] Keyon: byt_irq_thread, 216
[ 6.469318] Keyon: byt_irq_handler, 192, isr:0x0
[ 6.469335] sof-audio-acpi 808622A8:00: ipc tx succeeded: 0x30030000
[ 6.469352] sof-audio-acpi 808622A8:00: sink BUF3.1 control none source PGA3.0
[ 6.469355] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x0, ipcd:0x70028800
[ 6.469364] sof-audio-acpi 808622A8:00: ipc tx: 0x30030000
[ 6.469392] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469420] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469447] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469461] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x4000000000000000, ipcd:0x70028800
[ 6.469468] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469474] Keyon: byt_irq_thread, 216
[ 6.469485] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469529] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x0, ipcd:0x70028800
On Minnow board,
[ 12.833394] sof-audio-acpi 80860F28:01: booting DSP firmware
[ 12.835286] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835380] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835415] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835432] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835449] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835466] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835484] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835509] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835532] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835550] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835559] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x4000, ipcd:0x8000000070028800
[ 12.835567] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835591] Keyon: byt_irq_thread, 243
[ 12.835597] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835619] sof-audio-acpi 80860F28:01: ipc rx: 0x70000000
[ 12.835628] sof-audio-acpi 80860F28:01: ipc: DSP is ready 0x70000000 offset 0x144000
[ 12.835657] sof-audio-acpi 80860F28:01: Firmware info: version 1:1:0-f004d
[ 12.835665] sof-audio-acpi 80860F28:01: Firmware: ABI 3:11:0 Kernel ABI 3:11:0
[ 12.835673] sof-audio-acpi 80860F28:01: Firmware debug build 1 on Nov 14 2019-21:16:33 - options:
@keyonjie the results are as follows:
On BYT/CHT w/ EFi BIOS we have no issues with module load/reload (using PR #1504)
On the CHT Cyan device w/ the legacy BIOS, the second modprobe snd-sof-acpi fails. The issue happens with the max98090 codec as well as nocodec mode.
What we'd need is try PR #1504 in a Chrome/Coreboot environment to figure out what is missing.
@keyonjie the results are as follows:
On BYT/CHT w/ EFi BIOS we have no issues with module load/reload (using PR #1504)
On the CHT Cyan device w/ the legacy BIOS, the second modprobe snd-sof-acpi fails. The issue happens with the max98090 codec as well as nocodec mode.
What we'd need is try PR #1504 in a Chrome/Coreboot environment to figure out what is missing.
yes, I thought about trying it on ChromeOS environment also, @plbossart do you have ChromeOS kernel building environment for BSW, is it 4.14 based or even older? We need SOF driver back-port effort to it, right?
BTW, if we have any other non-chromeos based BSW platforms, it could be much easier for us to perform this trying.
@keyonjie I have two CHT devices, which is theory are identical to BSW.
For the chrome boot, @cujomalainey provided some answers in one of the issues. It's relatively complicated so you may want to talk to your neighbors in JF on how to set-up a working development environment where you can use the latest kernel. Backporting is a non-starter.
I am working on backporting SOF for Cyan to our 4.19 tree. I am still debugging the boot issue where if boot beep is played then audio is corrupted/broken. There is another issue where if the 3rd DMAC on CHT is enabled then it just hangs the pipelines in the DSP. (See the kconfig I added to disable it)
I am working on backporting SOF for Cyan to our 4.19 tree. I am still debugging the boot issue where if boot beep is played then audio is corrupted/broken. There is another issue where if the 3rd DMAC on CHT is enabled then it just hangs the pipelines in the DSP. (See the kconfig I added to disable it)
Good to know that you have already made these finding and progress on SOF for Cyan, @cujomalainey can you share the branch name of that?
The change is already on master. Look for CONFIG_CHERRYTRAIL_EXTRA_DW_DMA
@plbossart @keyonjie I did some tests today and what I found was that if I modify the kmod tests to do:
sof_remove.sh --> rtcwake -s 2 -m mem -> sof_insert.sh
Then, I can run the sof_bootloop.sh any number of times. I cant quite explain what happens when we run rtwake though. Could it possibly power off the LPE and then sof_insert.sh works like a fresh boot?
@plbossart @keyonjie I did some tests today and what I found was that if I modify the kmod tests to do:
sof_remove.sh --> rtcwake -s 2 -m mem -> sof_insert.sh
Then, I can run the sof_bootloop.sh any number of times. I cant quite explain what happens when we run rtwake though. Could it possibly power off the LPE and then sof_insert.sh works like a fresh boot?
Yes, I explained that previously, doing an S3 cycle before reloading will work-around the issue, looks some power off from PMC will help cold reset the Audio DSP.
@plbossart @keyonjie I did some tests today and what I found was that if I modify the kmod tests to do: sof_remove.sh --> rtcwake -s 2 -m mem -> sof_insert.sh Then, I can run the sof_bootloop.sh any number of times. I cant quite explain what happens when we run rtwake though. Could it possibly power off the LPE and then sof_insert.sh works like a fresh boot?
Yes, I explained that previously, doing an S3 cycle before reloading will work-around the issue, looks some power off from PMC will help cold reset the Audio DSP.
that really hints at a BIOS issue on Cyan. No such problem with UEFI versions on other CHT devices.
that really hints at a BIOS issue on Cyan. No such problem with UEFI versions on other CHT devices.
@plbossart @keyonjie I finally figured out a way to get this working. The trick was to enable PM for the device. I've still got a small niggle to resolve and then I will submit the PR
fixed now
I can't recall is this is supposed to work and if the tests are actually working: I only see this
"test case check-kmod-load-unload-after-playback.sh is SKIP! Catch ignore field of test-case: https://github.com/thesofproject/linux/issues/1449!"
not exactly useful log...
not exactly useful log...
Well the test was supposedly not run at all.
To be honest I was surprised to see this bug explicitly pointed at because the above comments sound like this has been fixed for good a long time ago. So maybe this SKIP should point at a different/new bug? If correct let's not close this one until the SKIP points at the right one.
Is there some (internal?) git log / record of who added the SKIP and if not should there be an SKIP author field maybe? @xiulipan, @keqiaozhang ?
@marc-hb ,please refer to sof-framework/parameters/pm-known-issue.csv.
I checked the kmod test manually on BSW and passed the test, I think we can add this test back.
that really hints at a BIOS issue on Cyan. No such problem with UEFI versions on other CHT devices.
@marc-hb @keqiaozhang @plbossart The issue used to be confirmed as a BIOS issue on our BSW device. Not sure if this is fixed now with current code base. I think we can do a new round manually test on our BSW
@fredoh9 Could you help to manually check on BSW to see if the known issue is fixed on BSW DUT now?
Reported in https://sof-ci.01.org/sofpr/PR3747/build7934/devicetest/?model=BSW_CYN_MAX98090&testcase=check-kmod-load-unload, https://sof-ci.01.org/sofpr/PR3747/build7934/devicetest/?model=BSW_CYN_MAX98090&testcase=check-kmod-load-unload-after-playback
@marc-hb This issue is just as Xiuli described, is a known issue. the link you give just tells us that the test is skipped, just because of this #1449 issue. I guess you may misunderstand the information in the result page.
stress test failed on bsw-cyn-max98090:
[88679.761719] kernel: sof-audio-acpi 808622A8:00: error: firmware boot failure
[88679.761805] kernel: sof-audio-acpi 808622A8:00: invalid header size 0xffffffff. FW oops is bogus
[88679.761838] kernel: sof-audio-acpi 808622A8:00: error: unexpected fault 0x90020000 trace 0x00003100
[88679.761871] kernel: sof-audio-acpi 808622A8:00: error: ipc host -> DSP: pending no complete no raw 0x4000000000003100
[88679.761901] kernel: sof-audio-acpi 808622A8:00: error: mask host: pending no complete yes raw 0x1
[88679.761929] kernel: sof-audio-acpi 808622A8:00: error: ipc DSP -> host: pending yes complete no raw 0x90020000
[88679.761957] kernel: sof-audio-acpi 808622A8:00: error: mask DSP: pending yes complete yes raw 0x3
[88679.761986] kernel: sof-audio-acpi 808622A8:00: error: failed to boot DSP firmware -5
[88679.763312] kernel: sof-audio-acpi 808622A8:00: error: sof_probe_work failed err: -5
yes, there's a known coreboot issue that hasn't been fixed. I don't think we can keep any tests that re-downloads firmware, which rules out kmod tests unfortunately.
@marc-hb This issue is just as Xiuli described, is a known issue. the link you give just tells us that the test is skipped, just because of this #1449 issue. I guess you may misunderstand the information in the result page.
If this is a known issue that requires skipping tests then it's obviously not "fixed". Yet it was closed as "fixed" in May 2020, so something is wrong somewhere and that's why I re-opened it. No need to understand what exactly is wrong to re-open.
Are you saying this issue should have been closed as "wontfix"?
I think what happened is that we thought we had a fix and closed the issue, but then later on found the problem remained and the root cause so we disabled the tests without updating this Linux issue.
I am positive this kmod test cannot work reliably.
Agree with @plbossart. This issue should be closed as "wont fix"
Closing as wontfix, thanks everyone!
On BSW_CYN_MAX98090, running sof_insert.sh after sof_remove.sh scripts to test sof module unload/reload capability, it will hang the whole Linux system.