Closed keqiaozhang closed 5 years ago
@Jiangxinx please help to check this issue on other platforms, like UP2 byt etc.
I can also reproduce this issue on WHL, I believe this is a common issue.
I have the same issue on WHL but after boot OS without making any playback, just checked dmesg:
Environment: Platform: WHL RVP - HDA mode Kernel: 2de29fae6 [thesofproject/linux: topic/sof-dev] +d04ccf494 [RanderWang/linux: cml-validation-20190415] +bfb0729dd [RanderWang/linux: cml-validation-20190415] +8a2b13a81 [RanderWang/linux: cml-validation-20190415] +63ceb8f68 [RanderWang/linux: cml-validation-20190415] +db89510d0 [RanderWang/linux: cml-validation-20190415]
Firmware: 00bbb113d [thesofproject/sof: master] Topology: 00bbb113d [thesofproject/sof: master] -> sof-hda-generic.tplg Pulse audio enabled.
@dabekjakub Please help to check from firmware side. Thanks!
@dabekjakub @emilchudzik That is an interesting issue, as both Volume and IPC used the scheduler. I suggest to check if there is any conflict or corruption happen before schedule tasks. https://github.com/thesofproject/sof/blob/e14ab7088bf4673f43396f02490f717f69de0de9/src/audio/volume.c#L343
@keqiaozhang @emilchudzik Could you check whether PR #1269 fixes this issue. In scheduler API refactor (#1089) task priority levels were changed to increasing order (0 - lowest; 1 - highest). I think that driver set pipeline priority level to default "0" value, so now pipeline task has lower priority level than ipc task.
@bkokoszx This issue still can be reproduced after applying PR #1269 .
[ 153.062892] sof-audio-pci 0000:00:0e.0: ipc rx: 0x90020000: GLB_TRACE_MSG
[ 153.062898] sof-audio-pci 0000:00:0e.0: ipc rx done: 0x90020000: GLB_TRACE_MSG
[ 153.066971] sof-audio-pci 0000:00:0e.0: ipc tx: 0x50020000: GLB_COMP_MSG: SET_VALUE: GET_VALUE
[ 153.373058] sof-audio-pci 0000:00:0e.0: error: ipc timed out for 0x50020000 size 108
[ 153.373064] sof-audio-pci 0000:00:0e.0: status: fw entered - code 00000005
[ 153.373102] sof-audio-pci 0000:00:0e.0: error: unexpected fault 0x00000000 trace 0x00004000
[ 153.373104] sof-audio-pci 0000:00:0e.0: error: waking up any trace sleepers
[ 153.373106] sof-audio-pci 0000:00:0e.0: error: set/get ctrl ipc comp 35
[ 153.373121] sof-audio-pci 0000:00:0e.0: ipc tx: 0x50010000: GLB_COMP_MSG: SET_VALUE: SET_VALUE
[ 153.677057] sof-audio-pci 0000:00:0e.0: error: ipc timed out for 0x50010000 size 108
[ 153.677060] sof-audio-pci 0000:00:0e.0: status: fw entered - code 00000005
[ 153.677097] sof-audio-pci 0000:00:0e.0: error: unexpected fault 0x00000000 trace 0x00004000
[ 153.677098] sof-audio-pci 0000:00:0e.0: error: waking up any trace sleepers
[ 153.677101] sof-audio-pci 0000:00:0e.0: error: set/get ctrl ipc comp 35
[ 153.677343] sof-audio-pci 0000:00:0e.0: ipc tx: 0x50020000: GLB_COMP_MSG: SET_VALUE: GET_VALUE
I have no issue with PR #1269 applied on the current sof master branch without reverting any others commits. Without this PR I had an erros even during boot. With this patch loading FW is ok, and changing volume also works good:
Environment: Platform: WHL RVP - HDA mode Kernel: 2de29fae6 [thesofproject/linux: topic/sof-dev] +d04ccf494 [RanderWang/linux: cml-validation-20190415] +bfb0729dd [RanderWang/linux: cml-validation-20190415] +8a2b13a81 [RanderWang/linux: cml-validation-20190415] +63ceb8f68 [RanderWang/linux: cml-validation-20190415] +db89510d0 [RanderWang/linux: cml-validation-20190415]
Firmware: 1b15fad [thesofproject/sof: master] + https://github.com/thesofproject/sof/pull/1269/commits/624c18935589b4f87bc4a058959ef5d53ac26882 Topology: 1b15fad [thesofproject/sof: master] -> sof-hda-generic.tplg
Pulse audio enabled or disabled - also works.
@emilchudzik Please see my comments in #1269 I have some concern about scheduler priority
@emilchudzik i've tried #1269 with sof-master: 00bbb113d0 and it didnt help. I still see the timeouts happening with the volume stress test script.
@lgirdwood I wanted to point out one thing here. As crazy as it sounds, adding a trace message before processing the ipc command reduces the rate of occurence of the ipc timeout. This seems to suggest a race in the IRQ handler to me.
diff --git a/src/drivers/intel/cavs/ipc.c b/src/drivers/intel/cavs/ipc.c
index 5ca723f5..68d0ee08 100644
--- a/src/drivers/intel/cavs/ipc.c
+++ b/src/drivers/intel/cavs/ipc.c
@@ -143,6 +143,8 @@ void ipc_platform_do_cmd(struct ipc *ipc)
struct sof_ipc_reply reply;
int32_t err;
+ trace_ipc("in ipc_platform_do_cmd()\n");
+
/* perform command and return any error */
err = ipc_cmd();
The volume stress test passes with the below change along with PR #1269
diff --git a/src/drivers/intel/cavs/ipc.c b/src/drivers/intel/cavs/ipc.c
index 5ca723f5..f7facb82 100644
--- a/src/drivers/intel/cavs/ipc.c
+++ b/src/drivers/intel/cavs/ipc.c
@@ -109,6 +109,7 @@ static void ipc_irq_handler(void *arg)
_ipc->host_pending = 1;
ipc_schedule_process(_ipc);
}
+ return;
}
/* reply message(done) from host */
@@ -143,6 +144,8 @@ void ipc_platform_do_cmd(struct ipc *ipc)
struct sof_ipc_reply reply;
int32_t err;
+ trace_ipc("do_cmd()\n");
+
/* perform command and return any error */
err = ipc_cmd();
@ranj063 good find, can you confirm if the trace_ipc("do_cmd()\n"); is still needed ? The change here means we don't respond to DONE notification from the host until the next IRQ.
One other thing to check is that volume.c will be scheduling the ramp a lot and it's possible it's being scheduled more than once whilst it's still ramping (i.e. scheduled twice). We probably want to cancel it and rerun.
@lgirdwood I tried the volume cancel and schedule task already and.it didn't help.
The trace message is still needed. Without it, the ipc still randomly times out. Btw, I found this problem a.week ago with the fuzzer and I assumed the race was qemu related haha. Seems like it isn't.
@ranj063 ok, probably worth attaching GDB via sof-fuzzer to check state. To rule out locking it may also be worth while replacing your trace with a loop (you can vary the length - make sure it's not optimized out by the compiler, use a volatile loop counter if needed).
@lgirdwood I will try adding to the fuzzer today but the fuzzer has the drawback of needing a 5ms between IPC's which isnt the case on the real hardware. I will update soon
@lgirdwood i'm able to reproduce the timeout with the fuzzer and here're the registers after the timeout. Does it tell you anything? @mmaka1 could you please also take a look? This is on BYT btw. (qemu) info registers
PC=ff2d1ebb
LBEG=ff2d2ee2 LEND=ff2d2ef8 LCOUNT=00000000 SAR=0000000e
BR=00000000 SCOMPARE1=00000000 ACCLO=00000000 ACCHI=00000000
MR0=00000000 MR1=00000000 MR2=00000000 MR3=00000000
PREFCTL=00000000 WINDOW_BASE=00000005 WINDOW_START=00000020 IBREAKENABLE=00000000
MEMCTL=00000000 ATOMCTL=00000015 IBREAKA0=00000000 IBREAKA1=00000000
DBREAKA0=00000000 DBREAKA1=00000000 DBREAKC0=00000000 DBREAKC1=00000000
CONFIGID0=c2b3dbfe EPC1=ff2d030c EPC2=00000000 EPC3=ff2d2e9a
EPC4=ff2d1ebb EPC5=ff2d1ebb EPC6=00000000 EPC7=00000000
DEPC=00000000 EPS2=00000000 EPS3=00060720 EPS4=00060720
EPS5=00060720 EPS6=00000000 EPS7=00000000 CONFIGID1=1c85483e
EXCSAVE1=00000000 EXCSAVE2=ff2d2820 EXCSAVE3=ff2d28f4 EXCSAVE4=ff2d29c8
EXCSAVE5=ff2d2a9c EXCSAVE6=00000000 EXCSAVE7=00000000 CPENABLE=00000000
INTSET=00000400 INTCLEAR=00000000 INTENABLE=00078f48 PS=00060720
VECBASE=ff2c0400 EXCCAUSE=00000000 DEBUGCAUSE=00000000 CCOUNT=1f0db828
PRID=00000000 ICOUNT=00000000 ICOUNTLEVEL=00000000 EXCVADDR=00000000
CCOMPARE0=00000000 CCOMPARE1=00000000 CCOMPARE2=00000000
THREADPTR=00000000
A00=bf2cc830 A01=ff327fa0 A02=ff303274 A03=00000047
A04=0dead00a A05=ff301e8c A06=00000000 A07=00000000
A08=00000000 A09=ff327f80 A10=ff3069cc A11=ff301fb0
A12=ff306800 A13=00000000 A14=00000000 A15=ff306928
AR00=ff306800 AR01=00000000 AR02=00000000 AR03=ff306928
AR04=bf2d037b AR05=ff327f60 AR06=0f86a442 AR07=00000000
AR08=00060720 AR09=00060023 AR10=00000000 AR11=00000000
AR12=bf2cb3f5 AR13=ff327f50 AR14=00000000 AR15=ff30693c
AR16=00060720 AR17=00000f00 AR18=0f86a92f AR19=0000000a
AR20=bf2cc830 AR21=ff327fa0 AR22=ff303274 AR23=00000047
AR24=0dead00a AR25=ff301e8c AR26=00000000 AR27=00000000
AR28=00000000 AR29=ff327f80 AR30=ff3069cc AR31=ff301fb0
And gdb bt shows: (gdb) bt
at /home/ranjani/work/sof/sof/src/tasks/audio.c:76
at /home/ranjani/work/sof/sof/src/init/init.c:86
at /home/ranjani/work/sof/sof/src/init/init.c:102
@tlauda could you please help take a look at the register info above too?
Today, I run the simultaneous playback/recording test on GLK-ubuntuOS with latest code base and got another type of IPC times out error: Trace:
CORE LEVEL COMP_ID TIMESTAMP DELTA FILE_NAME CONTENT
0 1 DMA 54143363.802083 54143364.000000 intel/cavs/hda-dma.c:279 hda-dmac: 5 wait for buffer full timeout
0 1 HOST 54143370.104167 6.302083 src/audio/host.c:792 host_copy() error: dma_copy() failed, ret = 4294967234
0 1 PIPE 54143376.041667 5.937500 src/audio/pipeline.c:681 pipeline_copy() error: ret = -62, start->comp.id = 12, dir = 1
0 1 PIPE 2.13 54143382.968750 6.927083 src/audio/pipeline.c:789 pipeline_xrun_recover()
0 1 COMP 54143395.468750 12.500000 rc/audio/component.c:201 comp_set_state() error: wrong state = 5, COMP_TRIGGER_PREPARE
0 1 PIPE 54143400.937500 5.468750 src/audio/pipeline.c:365 pipeline_prepare() error: ret = -22,dev->comp.id = 8
0 1 PIPE 2.13 54143406.770833 5.833333 src/audio/pipeline.c:796 pipeline_xrun_recover() error: pipeline_prepare() failed, ret = -22
0 1 PIPE 2.13 54143412.083333 5.312500 src/audio/pipeline.c:865 pipeline_task(): xrun recover failed! pipeline will be stopped!
logger-t:
0 2 DAI 2.12 279782845.781250 3.697917 src/audio/dai.c:539 dai_comp_trigger(), START
0 1 DMA 279783073.437500 227.656250 intel/cavs/hda-dma.c:279 hda-dmac: 5 wait for buffer full timeout
0 1 HOST 279783079.635417 6.197917 src/audio/host.c:792 host_copy() error: dma_copy() failed, ret = 4294967234
0 1 PIPE 279783085.364583 5.729167 src/audio/pipeline.c:681 pipeline_copy() error: ret = -62, start->comp.id = 12, dir = 1
0 1 PIPE 2.13 279783092.239583 6.875000 src/audio/pipeline.c:789 pipeline_xrun_recover()
0 2 PIPE 2.13 279783096.979167 4.739583 src/audio/pipeline.c:358 pipeline_prepare()
0 2 HOST 279783100.937500 3.958333 src/audio/host.c:635 host_prepare()
0 1 COMP 279783104.739583 3.802083 rc/audio/component.c:201 comp_set_state() error: wrong state = 5, COMP_TRIGGER_PREPARE
0 1 PIPE 279783110.260417 5.520833 src/audio/pipeline.c:365 pipeline_prepare() error: ret = -22,dev->comp.id = 8
0 1 PIPE 2.13 279783116.250000 5.989583 src/audio/pipeline.c:796 pipeline_xrun_recover() error: pipeline_prepare() failed, ret = -22
0 1 PIPE 2.13 279783121.510417 5.260417 src/audio/pipeline.c:865 pipeline_task(): xrun recover failed! pipeline will be stopped!
dmesg:
[ 525.718085] sof-audio-pci 0000:00:0e.0: ipc tx: 0x60010000: GLB_STREAM_MSG: PCM_PARAMS
[ 526.021046] sof-audio-pci 0000:00:0e.0: error: ipc timed out for 0x60010000 size 108
[ 526.021052] sof-audio-pci 0000:00:0e.0: status: fw entered - code 00000005
[ 526.021091] sof-audio-pci 0000:00:0e.0: error: unexpected fault 0x00000000 trace 0x00004000
[ 526.021092] sof-audio-pci 0000:00:0e.0: error: waking up any trace sleepers
[ 526.021094] sof-audio-pci 0000:00:0e.0: error: hw params ipc failed for stream 2
[ 526.021097] sof-audio-pci 0000:00:0e.0: ASoC: 0000:00:0e.0 hw params failed: -110
[ 526.021100] Headset: ASoC: hw_params FE failed -110
dmesg.log logger.log logger-t.log
Test recipe: sof/tplg: master/50bd96d kernel: sof-dev/9086043
@tlauda @ranj063 looks like if we get a spurious IPC interrupt on BYT (not checked others yet) we dont clear it. The INTSET=00000400 INTCLEAR=00000000 INTENABLE=00078f48
shows the IPC IRQ is still set.....and this can race with driver setting/clearing BUSY/DONE bits.
static void irq_handler(void *arg)
{
uint32_t isr;
uint32_t imrd;
/* Interrupt arrived, check src */
isr = shim_read(SHIM_ISRD);
imrd = shim_read(SHIM_IMRD);
tracev_ipc("ipc: irq isr 0x%x", isr);
/* reply message(done) from host */
if (isr & SHIM_ISRD_DONE &&
!(imrd & SHIM_IMRD_DONE)) {
/* Mask Done interrupt before return */
shim_write(SHIM_IMRD, shim_read(SHIM_IMRD) | SHIM_IMRD_DONE);
interrupt_clear(PLATFORM_IPC_INTERRUPT);
do_notify();
}
/* new message from host */
if (isr & SHIM_ISRD_BUSY &&
!(imrd & SHIM_IMRD_BUSY)) {
/* Mask Busy interrupt before return */
shim_write(SHIM_IMRD, shim_read(SHIM_IMRD) | SHIM_IMRD_BUSY);
interrupt_clear(PLATFORM_IPC_INTERRUPT);
/* TODO: place message in Q and process later */
/* It's not Q ATM, may overwrite */
if (_ipc->host_pending) {
trace_ipc_error("ipc: dropping msg");
trace_ipc_error(" isr 0x%x imrd 0x%x ipcxh 0x%x",
isr, shim_read(SHIM_IMRD),
shim_read(SHIM_IPCXH));
} else {
_ipc->host_pending = 1;
ipc_schedule_process(_ipc);
}
}
}
@lgirdwood how should we deal with this?
@keqiaozhang Please keep GLK IPC related issues separated from BYT reports since there is different IPC HW in use. It is great to look at all of them and try to find some common root causes if there are any but most of the issues are very close to the platform specific HW (IPC HW, INT HW integration) and different piece of HW driver SOF code might be involved. [edit] I'm sorry @keqiaozhang , I missed this was BYT mixed into the thread initially created for GLK, not the other way around, so this comment should target BYT reports, not your GLK ones, in general still applies ;)
@mmaka1 fwiw, I'm beginning to suspect scheduling here (since it's appearing on GLK + CML). IPC and Volume ramps (which reschedule themselves) do schedule asynchronous tasks and we may have a race as enabling a trace item seemed to help.
@lgirdwood Above timeout reported by @keqiaozhang is waiting for buffer full timeout, so this is not typical IPC timeout. It should be put into separated issue. That particular error has two potential rootcauses:
@tlauda thanks I heard from @mmaka1 you have been deep diving the HW side. :) @mmaka1 @ranj063 and I had a quick call and @ranj063 will double check FW side to rule out locking, list manipulation etc in scheduler and IRQ logic.
@ranj063 this should change LL scheduler to IRQ level 1, EDF to level 4 and IPC IRQ at level 3
diff --git a/src/platform/intel/cavs/platform.c b/src/platform/intel/cavs/platform.c
index cab93797..dc1a455d 100644
--- a/src/platform/intel/cavs/platform.c
+++ b/src/platform/intel/cavs/platform.c
@@ -153,14 +153,14 @@ static const struct sof_ipc_window sram_window = {
struct timesource_data platform_generic_queue[] = {
{
.timer = {
- .id = TIMER3, /* external timer */
- .irq = IRQ_EXT_TSTAMP0_LVL2(0),
+ .id = TIMER1, /* external timer */
+ .irq = IRQ_NUM_TIMER1,
},
.clk = CLK_SSP,
.notifier = NOTIFIER_ID_SSP_FREQ,
- .timer_set = platform_timer_set,
- .timer_clear = platform_timer_clear,
- .timer_get = platform_timer_get,
+ .timer_set = timer_set,
+ .timer_clear = timer_clear,
+ .timer_get = timer_get_system,
},
{
.timer = {
@mmaka1 @ranj063 @tlauda looks like our IRQ handler will miss iicoming IRQs at the same level when it's processing.
static inline void irq_lvl2_handler(void *data, int level, uint32_t ilxsd,
uint32_t ilxmsd, uint32_t ilxmcd)
{
struct irq_desc *parent = (struct irq_desc *)data;
struct irq_desc *child = NULL;
struct list_item *clist;
uint32_t status;
uint32_t i = 0;
uint32_t unmask = 0;
/* mask the parent IRQ */
arch_interrupt_disable_mask(1 << level);
/* mask all child interrupts */
status = irq_read(ilxsd);
irq_write(ilxmsd, status);
/* handle each child */
while (status) { <<<<<<<<<<<<<<<<<<<<<<<<<<< status is now stale after 1st loop iteration.
/* any IRQ for this child bit ? */
if ((status & 0x1) == 0)
goto next;
/* get child if any and run handler */
list_for_item(clist, &parent->child[i]) {
child = container_of(clist, struct irq_desc, irq_list);
if (child && child->handler) {
child->handler(child->handler_arg);
unmask = child->unmask;
} else {
/* nobody cared ? */
trace_irq_error("irq_lvl2_handler() error: "
"nbc");
}
}
/* unmask this bit i interrupt */
if (unmask)
irq_write(ilxmcd, 0x1 << i);
next:
status >>= 1;
i++;
}
/* clear parent and unmask */
arch_interrupt_clear(level);
arch_interrupt_enable_mask(1 << level);
}
@lgirdwood testing now. Thanks
@lgirdwood with your patch in https://github.com/thesofproject/sof/issues/1263#issuecomment-484941979, the sound card fails to register. So maybe this change is too aggressive?
@lgirdwood @mmaka1 I did a quick test that @xiulipan and I discussed. I removed the call to schedule the vol_work in volume_ctrl_set_cmd() and I hit the same IPC timeout and this time a lot sooner. So I doubt this is a problem with interrupts getting missed but rather something more basic than that.
@ranj063 Can you try to do counting of handled volume IPCs? You could increment FW status register in memory window right before clearing BUSY bit. I would like to see if it's in fact FW lack of response or maybe host misses it.
I have a theory but Im not sure if it is right but let me try anyway.
We had the same volume test pass in the past and back then the host was scheduling a work for every new IPC that was sent. But recently, we've removed the work queue and we send the IPC's as soon as the DSP sends a DONE interrupt for the previous IPC. So I think that the context switching between IPC's between successive IPC's was helping. With that not happening anymore, maybe the DSP isnt able to handle the high rate of interrupts? @lgirdwood @mmaka1 is that a possibility?
@lgirdwood it looks like our theory might be correct.With you PR https://github.com/thesofproject/sof/pull/1290, I run into the same IPC time out when I flood the fw with test ipc;s(within 2000 iterations) error: ipc timed out for 0xb0010000 size 8 [ 91.679614] sof-audio-pci 0000:00:0e.0: status: fw entered - code 00000005 [ 91.679823] sof-audio-pci 0000:00:0e.0: error: unexpected fault 0x00000000 trace 0x00004000 [ 91.679826] sof-audio-pci 0000:00:0e.0: error: waking up any trace sleepers
Increasing the ipc task deadline to 200 (from 100) in the FW improves the test result but still fails within 10K iterations. It seems to suggest a bug in the EDF scheduler to me.
Actually after a bit of testing, I realised that the deadline doesn't seem to have much effect. The flood test just randomly fails within 10K iterations
@lgirdwood @ranj063 @tlauda @xiulipan @keyonjie @singalsu
Keyon found the one reason and discussed with me. I think it should be a root cause. User would send IPC cmd to FW to change volume and FW schedules a task vol_work to change the volume slowly and continuously.
A simple model for volume changing: change volume from 20 to 80. vol_work would be scheduled multiple times to change volume with a small step each time.
vol_work()
{
step = 4;
if volume < target volume
volume += step;
return 1ms to make it rescheduled after 1ms
else
return 0 to stop reschedule
}
For this case, the vol_work would be scheduled 15 times to change volume from 20 to 80 and spend 15ms. So If you send too much IPCs for volume in less than 15ms, all these IPCs could not be processed and finally timeout.
We could refine the work flow of volume adjustment. If the last volume task is still working, just change the value of target volume, return ok and the vol_work would dynamic check the target volume value then change the volume.
@RanderWang it's irrelevant how long the IPC work takes. IPC will only signal DONE when ready to accept new messages. IPC can schedule work and that work is run in a different context. Flood IPC must 10k iterations.
@ranj063 DSP HW can easily cope with high IPC rate, but in light of recent driver updates we have exposed a race in FW between IPC and scheduling. Can you craete a PR for your kernel flood test to allow the others to test too.
I am thinking out something improvement like this, can anybody working on this issue give a try if it take any effect:
diff --git a/src/audio/volume.c b/src/audio/volume.c
index 87d0c6a8..3ad8ee91 100644
--- a/src/audio/volume.c
+++ b/src/audio/volume.c
@@ -139,8 +139,10 @@ static uint64_t vol_work(void *data)
/* do we need to continue ramping */
if (again)
return VOL_RAMP_US;
- else
+ else {
+ cd->in_ramping = 0;
return 0;
+ }
}
/**
@@ -320,7 +322,10 @@ static int volume_ctrl_set_cmd(struct comp_dev *dev,
"invalid i = %u", i);
}
}
- schedule_task(&cd->volwork, VOL_RAMP_US, 0, 0);
+ if (!cd->in_ramping) {
+ cd->in_ramping = 1;
+ schedule_task(&cd->volwork, VOL_RAMP_US, 0, 0);
+ }
break;
case SOF_CTRL_CMD_SWITCH:
diff --git a/src/audio/volume.h b/src/audio/volume.h
index e27d1efb..633095ac 100644
--- a/src/audio/volume.h
+++ b/src/audio/volume.h
@@ -110,6 +110,7 @@ struct comp_data {
int32_t volume[SOF_IPC_MAX_CHANNELS]; /**< current volume */
int32_t tvolume[SOF_IPC_MAX_CHANNELS]; /**< target volume */
int32_t mvolume[SOF_IPC_MAX_CHANNELS]; /**< mute volume */
+ unt32_t in_ramping; /**< in volume stage */
/**< volume processing function */
void (*scale_vol)(struct comp_dev *dev, struct comp_buffer *sink,
struct comp_buffer *source, uint32_t frames);
@keyonjie this has nothing to do with volume, please read comment from @ranj063 about flood IPC.
@keyonjie I still can reproduce this issue on both GLK and WHL with your changes.
@ranj063 @lgirdwood IPC timeout is too generic and we are circling around. We need to identify what has actually happened:
Also we could try to disable ll scheduler during the test. Timer is on the same irq level as ipc handler, so maybe we still have some kind of bug/race condition there.
@tlauda the flood IPC test carried out by @ranj063 sends an IPC that the FW will schedule and return 0; It does nothing. See PR https://github.com/thesofproject/sof/pull/1290. This means there is no volume or LL scheduling, it's simple a race between EDF and IPC.
@lgirdwood Agent is scheduled on LL, so we always have tick of timer every 1 ms.
@tlauda SA should tick every 750ms on APL, but it could be easily commented out. Trace will also be every 500ms or shorter.
@lgirdwood Timer is configured to trigger every 1 ms, when there is at least one task in the queue. It's like that in order to support stream scheduling across different cores.
@tlauda I've found a race
while (1) {
/* sleep until next IPC or DMA */
sa_enter_idle(sof);
wait_for_interrupt(0) <<<<<< 1) LL IRQ wakes us here.
/* now process any IPC messages to host */
ipc_process_msg_queue();
/* schedule any idle tasks */
schedule();
<<<<<<< 2) IPC IRQ happens here.
}
IRQ 2 is never serviced until next LL IRQ which could timeout.
@tlauda can you point me at the code where we do the 1ms tick when there is a task, I assume this is using LL ?
@lgirdwood
@tlauda can you point me at the code where we do the 1ms tick when there is a task, I assume this is using LL ?
ll_chedule.c, line 116: ticks = queue_calc_next_timeout(queue, ll_get_timer(queue))
. This is the place, where we check the total number of tasks and start the timer if any is detected.
@tlauda I've found a race
I don't think it's a race. IPC IRQ will still be processed, since we are on passive level.
@lgirdwood @tlauda https://github.com/thesofproject/linux/pull/852 is the kernel PR.
I did quite a bit of testing last night and after I switched to the xcc compiled version of the FW, this issue happens far less frequently (after about 200K IPC's). Moreover, it is not reproduced on the Up2 but seen on the GLK chromebook.
I am going to modify the test a little bit further to add some payload to the test ipc as well apart from just having the header.
@lgirdwood @tlauda I was wrong about the UP2. With the attached script, I'm able to reproduce the timeout even on the UP2 quite quickly. test_ipc_flood.txt
@keqiaozhang could you please give it a try and see if you can reproduce it too?
Describe the bug This issue occurs during the volume change stress test, we can send IPC interrupts continuously by changing the PGA* volume.
To Reproduce The test script is attached. sof-test-volume.txt
Expected behavior No ipc errors when sending ipc interrupts continuously.
Impact IPC timed out
Environment
Screenshots or console output
Kmsg and trace dmesg.log logger.log