Closed zhuqingliang closed 6 years ago
the reproduce rate is about 15%
No errors/xruns in neither dmesg nor rmbox logs. Problem should be checked from kernel side.
@zhuqingliang Does this issue only happen on CNL?
If we confirmed, that the error isn't coming from the FW, can we move this issue to kernel?
@zhuqingliang I can't see any errors on the CNL logs.
@RanderWang on BYT looks like from log that kernel is sending 2 IPC messages at once.
[48194.809293] sof-audio sof-audio: stream_tag 0 [48194.809310] sof-audio sof-audio: pcm: trigger stream 0 dir 0 cmd 1 [48194.809334] sof-audio sof-audio: ipc: send 0x60010000 [48194.809357] sof-audio sof-audio: ipc: send 0x60040000 [48194.809490] sof-audio sof-audio: error: reply expected 0xc got 0x14 bytes [48194.809509] sof-audio sof-audio: error: ipc error for 0x60040000 size 0xc [48194.809521] Passthrough: ASoC: trigger FE failed -22 [48194.809528] sof-audio sof-audio: pcm: trigger stream 0 dir 0 cmd 0 [48194.809541] sof-audio sof-audio: ipc: send 0x60050000 [48194.810744] sof-audio sof-audio: error: ipc error for 0x60050000 size 0xc [48194.810765] Passthrough: ASoC: trigger FE failed -22 [48195.068105] sof-audio sof-audio: ipc rx: 0x90020000 done [48195.133902] sof-audio sof-audio: error: ipc timed out for 0x60010000 size 0x60 [48195.133958] sof-audio sof-audio: error: unexpected fault 0x90020000 trace 0x60050000 [48195.133971] sof-audio sof-audio: error: waking up any trace sleepers [48195.134007] sof-audio sof-audio: error: got wrong posn offset 0xffffa08b for PCM -110 [48195.134022] sof-audio sof-audio: ASoC: sof-audio hw params failed: -110 [48195.134037] Passthrough: ASoC: hw_params FE failed -110
Update: This issue still can be reproduced with latest code base. When input/output error occurs, aplay and arecord will stop immediately, but no ipc timed out or firmware panic error, so you can rerun aplay and arecord.
@keqiaozhang can you try this again on BYT after reverting kernel commit 145f6f21fe0d35d02443d921bdb435cf80be8aee.
Can you paste the cmd line console log when the error happens on CNL/APL.
@lgirdwood After reverting the patch, this issue is hard to reproduce on BYT, I tested it more than 100 times. But on CNL and APL, this issue still can be reproduced and reproducibility is low. test@test-CannonLake:~$ aplay -Dhw:0,0 -fS16_LE -c2 -r 48000 /dev/zero -d 3 & arecord -Dhw:0,0 -c2 -r 48000 -fS16_Le /dev/null -d 3 [1] 2216 Recording WAVE '/dev/null' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo Playing raw data '/dev/zero' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo arecord: pcm_read:2103: read error: Input/output error
There're no error message in dmesg or rmbox.
@RanderWang looks like locking on HW IPC needs some updates. Can you revert the BYT kernel patch too.
@lgirdwood I test it with latest code and only find the error: aplay: main:788: audio open error: Device or resource busy.
It is not caused by my patch and I got this bug for a long time. there is a anther error message without my patch.
I am debugging it and will update what I find
@RanderWang ok, I'm just suspicious that we see 2 IPC commands.
I tested with latest kernel driver & FW, and the batch file from keqiao with a few times. all the tests are successful and no any error. @zhuqingliang can you help to validate it again.
@RanderWang I tested with latest code base and this issue still exist. sof : bfc7e467f7b283d0b7db5dff7321c8c5ec400980 soft : fe1f3ba18072ab70697d00648cd110a3131c12b4 sof-dev : 2a7a5d94466a903a05b42093eb2a383510704ded tplg file : test-ssp0-mclk-0-I2S-volume-s16le-s16 le-48k-24000k-nocodec.tplg complier: xcc
It cant be reproduced on BYT, and failed on APL. So I will debug it on APL
The dai position of playback is overwritten by record, so no dai position is updated to alsa
ipc: firmware initiated, msg:0x600a0000, msg_ext:0x0 ipc: firmware initiated, msg:0x600a0006, msg_ext:0x0 ipc rx: 0x600a0006 posn : host 0x3f00 dai 0x3f00 dir 0x1 PCM: stream 0 dir 1 DMA position 4032 DAI position 4032 ipc rx done: 0x600a0006 PCM: stream 0 dir 1 DMA position 4032 DAI position 4032 ipc: firmware initiated, msg:0x600a0000, msg_ext:0x0 ipc rx: 0x600a0000
@lgirdwood @xiulipan the msg sent to host is not serialized. In dsp we need to check the pre-msg has been processed when sending next msg. I check the APL spec, the done bit in DSP is cleared immediately after host clear busy bit in IRQ function. Now I try find a way to clear done bit after a msg is processed, but I failed to do it.
Hi xiuli, do you have any finding?
@RanderWang good finding. this will surely cause some problem. Let use find a workaround for it to confirm the bug.
@lgirdwood I think this may also influenced DMA trace position update IPC, maybe #210 have some related to this issue.
@lgirdwood @xiulipan I found a correct register setting sequence. Next I will submit a PR to refine SOF and kernel.
@RanderWang @xiulipan notifications should be serialised. A queue has already been implemented on FW side to help with this, so it looks like the notification busy status is incorrect.
@lgirdwood @xiulipan
I make the patch, it works with some random IPC timeout at topology loading time. The IPC timeout issue is not caused by my patch, but it just makes a interrupt issue come out. I debug with xiuli , find that the IPC interrupt is received by DSP, but the FW sleep at "waiti", so the IPC_CMD function cant be scheduled. It is caused by DMA trace interrupt.
I and xiuli will go on debugging it.
Next time I will update my GCC. According to xiuli, my patch works well with newer GCC.
@RanderWang how do you know IRQ is received by DSP when FW is at waiti ? What about IRQ status and masks (there will be 2 mask registers that can block this IRQ). One is IPC SHIM mask register and the other is DSP register INTENABLE
could be fixed by PR https://github.com/thesofproject/linux/pull/88 and #232
@lgirdwood let me make some clarify. When test on Rander's dev platfom with his build env. We meet some IPC timeout issues on BOOT up tplg load time. But the DSP is not dead since it will recover to receive IPC from HOST to finish other tplg load work and can playback on the succeed pipeline.
With debug, we found that the DSP receive the IPC from host, but did not handle it ontime.
After add log to irq_handler
and ipc_platform_do_cmd
, we found that the DSP receive the IPC IRQ, but did not enter ipc_platform_do_cmd to handler it after about 500ms. It seems that we have hang in wait_for_interrupt(0);
in the while loop of do_task
.
We suspect that this issues is related to DMA TRACE and after disable DMA TRACE, this boot up bug is gone. And test with my build env, this bug is also gone. Maybe some similar bug to #210.
@xiulipan two changes can be made
1) can you add a check prior to waiti that checks the INTENABLE register against a mask of known and active interrupts for that platform. if INTENABLE & mask == 0 then we will never wake up. i.e. platform mask would always have IPC IRQ enabled, plus any active timers, plus any DMA. This check should return immediately (and trace once) if no active IRQ sources are set prior to waiti.
2) The fix is probably here ` while (1) { / sleep until next IPC or DMA / sa_enter_idle(sof); wait_for_interrupt(0);
/* now process any IPC messages from host */
ipc_process_msg_queue();
<<< consider IPC IRQ happening here..........will have to wait on next DMA trace IRQ to process.
/ schedule any idle tasks / schedule(); }`
ipc_process_msg_queue() should leave the IPC mask enabled. IPC IRQ mask can then be cleared berfore wiati is called.
@lgirdwood
ipc_process_msg_queue();
?
But in some ipc handler function, such as stop/pause. We have some wait for complete function that shares the same wait function. What should we do to these functions, to enable only DMA level irq with the right MASK?@xiulipan , no I mean keep the IPC IRQ disabled until we enter WFI
cnl is fixed by https://github.com/thesofproject/sof/pull/235 apl is fixed by https://github.com/thesofproject/sof/pull/234
But there is another issue on apl with this PR. We will go on working on it.
@RanderWang ok, but my comment 2 above still needs fixed as it's a race.
@lgirdwood Dmesg: Failed to new a dai comp:
[ 3.588259] sof-audio sof-audio: error: ipc timed out for 0x30010000 size 0x38 [ 3.588317] sof-audio sof-audio: status: fw entered - code 00000005 [ 3.588369] sof-audio sof-audio: error: unexpected fault 0x00000000 trace 0x00004000 [ 3.588411] sof-audio sof-audio: error: waking up any trace sleepers [ 3.588453] sof-audio sof-audio: error: DSP failed to add widget id 0 type 28 name : SSP1.IN stream NoCodec-1 reply 0 [ 3.588513] sof-audio sof-audio: ASoC: failed to load widget SSP1.IN [ 3.588550] sof-audio sof-audio: error: tplg component load failed -110 [ 3.588597] sof-audio sof-audio: error: failed to load DSP topology -22 [ 3.588634] sof-audio sof-audio: ASoC: failed to probe component -22 [ 3.588685] sof-nocodec sof-nocodec: ASoC: failed to instantiate card -22 [ 3.588779] sof-nocodec: probe of sof-nocodec failed with error -22 [ 3.773269] sof-audio sof-audio: error: rx list empty but received 0x0 [ 3.773276] sof-audio sof-audio: error: can't find message header 0x30010000 [ 3.773298] sof-audio sof-audio: ipc rx: 0x90020000: GLB_TRACE_MSG
Confirmed from the trace log
0x1090 [0.086558] delta [0.000002] wait WFE //WFI start 0x10a0 [0.086559] delta [0.000001] value 0x00000000beef0000 // INTENABLE dump with some prefix 0x10b0 [0.086624] delta [0.000064] ipc IRQ 0x10c0 [0.086625] delta [0.000002] ipc Rpy // some reply message from DMA trace enable 0x10d0 [0.086627] delta [0.000002] wait WFX // WFI end 0x10e0 [0.086629] delta [0.000002] pipe sch //schedule in 0x10f0 [0.086631] delta [0.000002] ipc IRQ // new ipc 0x30010000 comes 0x1100 [0.086633] delta [0.000002] ipc Nms 0x1110 [0.086636] delta [0.000003] wait WFE //WFI start 0x1120 [0.086637] delta [0.000001] value 0x00000000beef0000 // all IRQ enable 0x1130 [0.581273] delta [0.494636] host GwU // DMA trace 0x1140 [0.581275] delta [0.000002] pm put 0x1150 [0.581313] delta [0.000038] wait WFX 0x1160 [0.581315] delta [0.000001] ipc Cmd // ipc_process_msg_queue 0x1170 [0.581317] delta [0.000002] ipc tcn 0x1180 [0.581322] delta [0.000005] dai new 0x1190 [0.581335] delta [0.000013] value 0x0000000000000001
So I think we need to both disable IPC irq after WFI.
@xiulipan I'm not following your last statement. I mean something like this
diff --git a/src/ipc/apl-ipc.c b/src/ipc/apl-ipc.c
index 233d527b..0be00b0d 100644
--- a/src/ipc/apl-ipc.c
+++ b/src/ipc/apl-ipc.c
@@ -130,9 +130,6 @@ done:
/* write 1 to clear busy, and trigger interrupt to host*/
ipc_write(IPC_DIPCT, ipc_read(IPC_DIPCT) |IPC_DIPCT_BUSY );
- /* unmask Busy interrupt */
- ipc_write(IPC_DIPCCTL, ipc_read(IPC_DIPCCTL) | IPC_DIPCCTL_IPCTBIE);
-
// TODO: signal audio work to enter D3 in normal context
/* are we about to enter D3 ? */
if (iipc->pm_prepare_D3) {
@@ -144,6 +141,12 @@ done:
}
+void ipc_platform_set_ready(struct ipc *ipc)
+{
+ /* unmask Busy interrupt */
+ ipc_write(IPC_DIPCCTL, ipc_read(IPC_DIPCCTL) | IPC_DIPCCTL_IPCTBIE);
+}
+
void ipc_platform_send_msg(struct ipc *ipc)
{
struct ipc_msg *msg;
diff --git a/src/ipc/handler.c b/src/ipc/handler.c
index 9e0401df..835a2132 100644
--- a/src/ipc/handler.c
+++ b/src/ipc/handler.c
@@ -1119,6 +1119,11 @@ out:
return ret;
}
+void ipc_process_set_ready(void)
+{
+ ipc_platform_set_ready(_ipc);
+}
+
/* process current message */
int ipc_process_msg_queue(void)
{
diff --git a/src/tasks/audio.c b/src/tasks/audio.c
index 5beae08a..2c55cee9 100644
--- a/src/tasks/audio.c
+++ b/src/tasks/audio.c
@@ -83,6 +83,9 @@ int do_task(struct sof *sof)
/* sleep until next IPC or DMA */
sa_enter_idle(sof);
+
+ ipc_process_set_ready();
+
wait_for_interrupt(0);
/* now process any IPC messages from host */
lrg@borr:~/source/reef/sof.git$
@lgirdwood Last comment is some log that confirmed that we will have some IPC IRQ outside the WFI.
I will try this patch to see if it could work. So we do not mask Interrupt here now? But to disable the IPC IRQ come into DSP?
@xiulipan the intention here is that the IPC IRQ is masked until we enter WFI, meaning we catch all IPC IRQ. The current code has a race if a new IPC happens after the previous one is processed but before we enter WFI (i.e. IPC IRQ occurs but it is not processed until after WFI).
@lgirdwood Confirmed the race with trace log above.
New issues with TRACEV enable
0x2870 [0.060252] delta [0.000041] wait WFE //WFI enter 0x2880 [0.060254] delta [0.000002] ipc IRQ // new IPC comes 0x2890 [0.060257] delta [0.000002] ipc Nms 0x28a0 [0.060260] delta [0.000003] value 0x00000000beef0000 // this is INTENABLE dump value 0x28b0 [0.555001] delta [0.494741] host GwU // DMA trace? 0x28c0 [0.555004] delta [0.000003] pm put 0x28d0 [0.555043] delta [0.000039] wait WFX // WFI exit 0x28e0 [0.555045] delta [0.000002] ipc Cmd // ipc message handler 0x28f0 [0.555048] delta [0.000003] ipc Ibn 0x2900 [0.555055] delta [0.000007] buffer new // timeout here
It seems the ipc IRQ
is happening not inside the arch_wait_for_interrupt(level);
but in the tracev_event
If that means we should not have a log before WAITI, otherwise we may not wait what we expect.
static inline void wait_for_interrupt(int level)
{
tracev_event(TRACE_CLASS_WAIT, "WFE");
wait_atomic_check;
arch_wait_for_interrupt(level); // asm WAITI called here.
tracev_event(TRACE_CLASS_WAIT, "WFX");
}
I have another bug with this patch in stress test. I am debugging it.
[ 47.495727] sof-audio sof-audio: ipc rx done: 0x600a0000 [ 47.495766] sof-audio sof-audio: ipc rx: 0x900a0000 [ 47.495770] sof-audio sof-audio: error: unhandled trace message a0000 [ 47.495924] sof-audio sof-audio: pcm: free stream 0 dir 0 [ 47.496162] sof-audio sof-audio: ipc rx done: 0x900a0000 [ 47.496188] sof-audio sof-audio: ipc tx: 0x60030000 [ 47.520400] sof-audio sof-audio: cstream 1 status 0x24 [ 47.520461] sof-audio sof-audio: ipc rx: 0x600a0006
the bug disappears after I clean and rebase my kernel code. So it is not a bug.
@lgirdwood
I reviewed your patch. Do we need to check whether the IPC interrupt has been masked then do unmask? just like
if(masked IPC busy bit)
umask IPC busy bit
sorry, make a mistake
@lgirdwood I will create a new issue to track this new issue #238
I tested this issue on BYT on latest code base. Here are some findings: reproduce steps:
And one time after error happened, I found call traces in dmesg. But this is hard to reproduce(<5%) Logs: dmesg-8.27.log rmbox-8.27.log
Environment: sof:4aff776d1c290542ded98f8aa22f50d4b9477d37 soft:41861aac373eb1f6978bfdb0bbe97cd882b1785a sof-dev:7481f7af7ac711504ae52fa9ae5cb5c3322ea91f tplg:test-ssp2-mclk-0-I2S-volume-s16le-s16le-48k-19200k-codec.tplg
@ZhendanYang we need to be clear as we have two type of IPC errors. 1) IPC timeout - fatal usually means FW has hung/crashed. 2) IPC error - driver sent IPC request that FW cant handle (normal operation - may not be FW bug, but could be driver or topology issue.).
The last log is caused by the IPC notification coming when ALSA has closed the stream (but it looks liek FW did not get that message).
@lgirdwood I have checked the BYT dmesg. It seems there are some kernel panic
[ 3089.133253] WARNING: CPU: 1 PID: 1944 at sound/core/pcm_lib.c:1793 snd_pcm_period_elapsed+0x9c/0xb0 [snd_pcm] [ 3089.133259] Modules linked in: nls_iso8859_1 snd_soc_sst_bytcr_rt5651 intel_rapl intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp coretemp kvm_intel kvm irqbypass punit_atom_debug crct10dif_pclmul sof_acpi_dev crc32_pclmul snd_sof_intel_bdw ghash_clmulni_intel snd_sof_intel_byt pcbc snd_sof_nocodec snd_sof_intel_hsw aesni_intel snd_sof aes_x86_64 snd_sof_xtensa_dsp crypto_simd snd_soc_rt5651 snd_soc_acpi_intel_match cryptd snd_soc_acpi snd_soc_rl6231 glue_helper snd_soc_core snd_seq_midi snd_compress snd_seq_midi_event snd_pcm_dmaengine ac97_bus snd_rawmidi snd_seq snd_pcm snd_seq_device snd_timer snd soundcore mei_txe mei lpc_ich spi_pxa2xx_platform mac_hid 8250_dw pwm_lpss_platform pwm_lpss parport_pc ppdev lp parport ip_tables x_tables autofs4 mmc_block i915 igb dca drm_kms_helper i2c_algo_bit [ 3089.133324] syscopyarea sysfillrect sysimgblt fb_sys_fops video sdhci_acpi drm sdhci [ 3089.133334] CPU: 1 PID: 1944 Comm: kworker/1:2 Tainted: G W 4.18.0+ #11 [ 3089.133335] Hardware name: ADI Minnowboard Turbot D0 PLATFORM/MinnowBoard Turbot, BIOS MNW2MAX1.X64.0097.R01.1709211052 09/21/2017 [ 3089.133341] Workqueue: events ipc_msgs_rx [snd_sof] [ 3089.133352] RIP: 0010:snd_pcm_period_elapsed+0x9c/0xb0 [snd_pcm] [ 3089.133353] Code: 00 00 48 89 df e8 84 f7 ff ff 85 c0 78 ca f6 83 20 01 00 00 01 74 c1 48 8b bb 18 01 00 00 be 01 00 00 00 e8 c6 e1 d7 ff eb ae <0f> 0b 5b 5d 41 5c c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f [ 3089.133407] RSP: 0018:ffffad90c1a2fe00 EFLAGS: 00010246 [ 3089.133410] RAX: 0000000000000000 RBX: ffff88e732884818 RCX: 0000000000000006 [ 3089.133412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88e6f489a400 [ 3089.133413] RBP: 0000000000000000 R08: 000000000010a6fe R09: 0000000000000004 [ 3089.133415] R10: 00000000000003b5 R11: 0000000000000001 R12: ffff88e732679090 [ 3089.133417] R13: 0000000000000000 R14: ffff88e732fe2e40 R15: 0ffff88e736aa620 [ 3089.133419] FS: 0000000000000000(0000) GS:ffff88e736a80000(0000) knlGS:0000000000000000 [ 3089.133421] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3089.133423] CR2: 00007fe5a802f298 CR3: 00000000733ce000 CR4: 00000000001006e0 [ 3089.133424] Call Trace: [ 3089.133433] ipc_msgs_rx+0x32d/0x560 [snd_sof] [ 3089.133438] process_one_work+0x1f4/0x3e0 [ 3089.133442] worker_thread+0x2d/0x3e0 [ 3089.133445] ? process_one_work+0x3e0/0x3e0 [ 3089.133448] kthread+0x113/0x130 [ 3089.133451] ? kthread_create_worker_on_cpu+0x70/0x70 [ 3089.133455] ret_from_fork+0x35/0x40 [ 3089.133458] ---[ end trace 629812cbfc63d1ab ]---
This seems to be related with position update. I will try to reproduce that on BYT. But this bug seems to be related with kernel.
This issue can't be reproduced on APL and CNL with stable-1.2 branch(commit 251e68f291853e3c22ce617f23f94c01231268db), need to test on BYT before close this issue. Test on BYT is blocked by #274 .
Latest status:
on CNL with master branch: simultaneous playback and capture will cause Input/output error, no error in dmesg and can recover itself.
on APL with master branch: run "aplay -D hw:0,0 -c 2 -r 48000 -f s16_le /dev/zero --period-size=192 --buffer-size=384 -d 5 & arecord -D hw:0,6 -c 2 -r 48000 -f s32_le /dev/null --period-size=192 --buffer-size=384 -vv -i -d 5" at second time will cause DSP panic. [ 207.976202] sof-audio sof-audio: error : DSP panic! [ 207.976209] sof-audio sof-audio: panic: dsp_oops_offset 788480 offset 788480 [ 207.976213] sof-audio sof-audio: status: fw entered - code 00000005 [ 207.976233] sof-audio sof-audio: error: runtime exception [ 207.976237] sof-audio sof-audio: error: trace point 00004000 [ 207.976240] sof-audio sof-audio: error: DSP Firmware Oops [ 207.976245] sof-audio sof-audio: error: Exception Cause: LoadStorePIFDataErrorCause, Synchronous PIF data error during LoadStore access [ 207.976249] sof-audio sof-audio: EXCCAUSE 0x0000000d EXCVADDR 0x00000028 PS 0x00070025 SAR 0x00000019 [ 207.976255] sof-audio sof-audio: EPC1 0xbe01cb86 EPC2 0xbe010d11 EPC3 0xbe0122bd EPC4 0x00000000 [ 207.976257] sof-audio sof-audio: EPC5 0xbe010d11 EPC6 0x00000000 EPC7 0x00000000 DEPC 0x00000000 [ 207.976261] sof-audio sof-audio: EPS2 0x00060720 EPS3 0x00060022 EPS4 0x00000000 EPS5 0x00060720 [ 207.976267] sof-audio sof-audio: EPS6 0x00000000 EPS7 0x00000000 INTENABL 0x000101d8 INTERRU 0x00000222 [ 207.976270] sof-audio sof-audio: stack dump from 0xbe07fd50
on CNL with stable-1.2 branch:No such issue.
on APL with stable-1.2 branch: simultaneous playback and capture will cause Input/output error, no error in dmesg and can recover itself.
Test env: kernel: 50792bc sof master: f17abe644c3b583213914b3d98e27f55786ef8c7 sof stable-1.2: f7ea2e61606f78a047a7353cb8042242c93448dc
looks like a regression again between 251e68f and f7ea2e6 on stable-1.2? Can you help to bisect, @ZhendanYang
@ZhendanYang
This is a new issues with DMIC
Test with arecord -D hw:0,6 -f S32_LE tmp.wav -vv -i
and the second run will cause FW panic.
I create a new issuse https://github.com/thesofproject/sof/issues/391
Please do not test with DMIC in all stress test now.
This issue can't reproduce on stable-1.2 branch with latest code base on APL, BYT and CNL. Test env: sof stable-1.2: 7dd4b1d kernel sof-dev: bc51a6f soft master: d77beaf I'll test master branch later to see if we can close this issue.
I'll test master branch later to see if we can close this issue.
@ZhendanYang ping?
The reproducibility is about 30%. How to reproduce:
environment: on CNL sof-dev: 18917b91d619664c4e38b7180ff89bd9c1397758 sof: 5e83da7fc170dd9eccdf97e32a516a5b294a41bf soft: fdf39244b7684c4a17707f9bc9ffc92699230cb2 tplg file name : test-ssp0-mclk-0-I2S-volume-s24le-s24le-48k-24000k-nocodec.tplg log: dmesg.log memorywindow.log rmbox.log