thesofproject / sof

Sound Open Firmware
Other
560 stars 318 forks source link

[BUG] IPC times out when changing volume repeatedly #1263

Closed keqiaozhang closed 5 years ago

keqiaozhang commented 5 years ago

Describe the bug This issue occurs during the volume change stress test, we can send IPC interrupts continuously by changing the PGA* volume.

To Reproduce The test script is attached. sof-test-volume.txt

Expected behavior No ipc errors when sending ipc interrupts continuously.

Impact IPC timed out

Environment

  1. Kernel: origin/topic/sof-dev-2de29f Firmware: origin/sof-master-00bbb1
  2. tplg name:sof-glk-da7219.tplg
  3. GLK Chromebook.
  4. Reproducibility Rate:100%

Screenshots or console output

[  483.621778] sof-audio-pci 0000:00:0e.0: ipc rx: 0x90020000: GLB_TRACE_MSG
[  483.621784] sof-audio-pci 0000:00:0e.0: ipc rx done: 0x90020000: GLB_TRACE_MSG
[  483.626384] sof-audio-pci 0000:00:0e.0: ipc tx: 0x50020000: GLB_COMP_MSG: SET_VALUE: GET_VALUE
[  483.933022] sof-audio-pci 0000:00:0e.0: error: ipc timed out for 0x50020000 size 108
[  483.933027] sof-audio-pci 0000:00:0e.0: status: fw entered - code 00000005
[  483.933065] sof-audio-pci 0000:00:0e.0: error: unexpected fault 0x00000000 trace 0x00004000
[  483.933067] sof-audio-pci 0000:00:0e.0: error: waking up any trace sleepers
[  483.933069] sof-audio-pci 0000:00:0e.0: error: set/get ctrl ipc comp 35
[  483.933083] sof-audio-pci 0000:00:0e.0: ipc tx: 0x50010000: GLB_COMP_MSG: SET_VALUE: SET_VALUE
[  484.237031] sof-audio-pci 0000:00:0e.0: error: ipc timed out for 0x50010000 size 108
[  484.237035] sof-audio-pci 0000:00:0e.0: status: fw entered - code 00000005
[  484.237072] sof-audio-pci 0000:00:0e.0: error: unexpected fault 0x00000000 trace 0x00004000
[  484.237073] sof-audio-pci 0000:00:0e.0: error: waking up any trace sleepers
[  484.237075] sof-audio-pci 0000:00:0e.0: error: set/get ctrl ipc comp 35

Kmsg and trace dmesg.log logger.log

keqiaozhang commented 5 years ago

@Jiangxinx please help to check this issue on other platforms, like UP2 byt etc.

keqiaozhang commented 5 years ago

I can also reproduce this issue on WHL, I believe this is a common issue.

emilchudzik commented 5 years ago

I have the same issue on WHL but after boot OS without making any playback, just checked dmesg: image

Environment: Platform: WHL RVP - HDA mode Kernel:  2de29fae6 [thesofproject/linux: topic/sof-dev] +d04ccf494 [RanderWang/linux: cml-validation-20190415] +bfb0729dd [RanderWang/linux: cml-validation-20190415] +8a2b13a81 [RanderWang/linux: cml-validation-20190415] +63ceb8f68 [RanderWang/linux: cml-validation-20190415] +db89510d0 [RanderWang/linux: cml-validation-20190415]

Firmware: 00bbb113d [thesofproject/sof: master] Topology: 00bbb113d [thesofproject/sof: master] -> sof-hda-generic.tplg Pulse audio enabled.

dmesg.txt logger_sof-cnl_00bbb113.txt logger.txt

mengdonglin commented 5 years ago

@dabekjakub Please help to check from firmware side. Thanks!

xiulipan commented 5 years ago

@dabekjakub @emilchudzik That is an interesting issue, as both Volume and IPC used the scheduler. I suggest to check if there is any conflict or corruption happen before schedule tasks. https://github.com/thesofproject/sof/blob/e14ab7088bf4673f43396f02490f717f69de0de9/src/audio/volume.c#L343

https://github.com/thesofproject/sof/blob/e14ab7088bf4673f43396f02490f717f69de0de9/src/ipc/handler.c#L1280-L1283

bkokoszx commented 5 years ago

@keqiaozhang @emilchudzik Could you check whether PR #1269 fixes this issue. In scheduler API refactor (#1089) task priority levels were changed to increasing order (0 - lowest; 1 - highest). I think that driver set pipeline priority level to default "0" value, so now pipeline task has lower priority level than ipc task.

keqiaozhang commented 5 years ago

@bkokoszx This issue still can be reproduced after applying PR #1269 .


[  153.062892] sof-audio-pci 0000:00:0e.0: ipc rx: 0x90020000: GLB_TRACE_MSG
[  153.062898] sof-audio-pci 0000:00:0e.0: ipc rx done: 0x90020000: GLB_TRACE_MSG
[  153.066971] sof-audio-pci 0000:00:0e.0: ipc tx: 0x50020000: GLB_COMP_MSG: SET_VALUE: GET_VALUE
[  153.373058] sof-audio-pci 0000:00:0e.0: error: ipc timed out for 0x50020000 size 108
[  153.373064] sof-audio-pci 0000:00:0e.0: status: fw entered - code 00000005
[  153.373102] sof-audio-pci 0000:00:0e.0: error: unexpected fault 0x00000000 trace 0x00004000
[  153.373104] sof-audio-pci 0000:00:0e.0: error: waking up any trace sleepers
[  153.373106] sof-audio-pci 0000:00:0e.0: error: set/get ctrl ipc comp 35
[  153.373121] sof-audio-pci 0000:00:0e.0: ipc tx: 0x50010000: GLB_COMP_MSG: SET_VALUE: SET_VALUE
[  153.677057] sof-audio-pci 0000:00:0e.0: error: ipc timed out for 0x50010000 size 108
[  153.677060] sof-audio-pci 0000:00:0e.0: status: fw entered - code 00000005
[  153.677097] sof-audio-pci 0000:00:0e.0: error: unexpected fault 0x00000000 trace 0x00004000
[  153.677098] sof-audio-pci 0000:00:0e.0: error: waking up any trace sleepers
[  153.677101] sof-audio-pci 0000:00:0e.0: error: set/get ctrl ipc comp 35
[  153.677343] sof-audio-pci 0000:00:0e.0: ipc tx: 0x50020000: GLB_COMP_MSG: SET_VALUE: GET_VALUE
emilchudzik commented 5 years ago

I have no issue with PR #1269 applied on the current sof master branch without reverting any others commits. Without this PR I had an erros even during boot. With this patch loading FW is ok, and changing volume also works good: image

Environment: Platform: WHL RVP - HDA mode Kernel: 2de29fae6 [thesofproject/linux: topic/sof-dev] +d04ccf494 [RanderWang/linux: cml-validation-20190415] +bfb0729dd [RanderWang/linux: cml-validation-20190415] +8a2b13a81 [RanderWang/linux: cml-validation-20190415] +63ceb8f68 [RanderWang/linux: cml-validation-20190415] +db89510d0 [RanderWang/linux: cml-validation-20190415]

Firmware: 1b15fad [thesofproject/sof: master] + https://github.com/thesofproject/sof/pull/1269/commits/624c18935589b4f87bc4a058959ef5d53ac26882 Topology: 1b15fad [thesofproject/sof: master] -> sof-hda-generic.tplg

Pulse audio enabled or disabled - also works.

xiulipan commented 5 years ago

@emilchudzik Please see my comments in #1269 I have some concern about scheduler priority

ranj063 commented 5 years ago

@emilchudzik i've tried #1269 with sof-master: 00bbb113d0 and it didnt help. I still see the timeouts happening with the volume stress test script.

ranj063 commented 5 years ago

@lgirdwood I wanted to point out one thing here. As crazy as it sounds, adding a trace message before processing the ipc command reduces the rate of occurence of the ipc timeout. This seems to suggest a race in the IRQ handler to me.

diff --git a/src/drivers/intel/cavs/ipc.c b/src/drivers/intel/cavs/ipc.c
index 5ca723f5..68d0ee08 100644
--- a/src/drivers/intel/cavs/ipc.c
+++ b/src/drivers/intel/cavs/ipc.c
@@ -143,6 +143,8 @@ void ipc_platform_do_cmd(struct ipc *ipc)
        struct sof_ipc_reply reply;
        int32_t err;

+       trace_ipc("in ipc_platform_do_cmd()\n");
+
        /* perform command and return any error */
        err = ipc_cmd();
ranj063 commented 5 years ago

The volume stress test passes with the below change along with PR #1269

diff --git a/src/drivers/intel/cavs/ipc.c b/src/drivers/intel/cavs/ipc.c
index 5ca723f5..f7facb82 100644
--- a/src/drivers/intel/cavs/ipc.c
+++ b/src/drivers/intel/cavs/ipc.c
@@ -109,6 +109,7 @@ static void ipc_irq_handler(void *arg)
                        _ipc->host_pending = 1;
                        ipc_schedule_process(_ipc);
                }
+               return;
        }

        /* reply message(done) from host */
@@ -143,6 +144,8 @@ void ipc_platform_do_cmd(struct ipc *ipc)
        struct sof_ipc_reply reply;
        int32_t err;

+       trace_ipc("do_cmd()\n");
+
        /* perform command and return any error */
        err = ipc_cmd();
lgirdwood commented 5 years ago

@ranj063 good find, can you confirm if the trace_ipc("do_cmd()\n"); is still needed ? The change here means we don't respond to DONE notification from the host until the next IRQ.

One other thing to check is that volume.c will be scheduling the ramp a lot and it's possible it's being scheduled more than once whilst it's still ramping (i.e. scheduled twice). We probably want to cancel it and rerun.

ranj063 commented 5 years ago

@lgirdwood I tried the volume cancel and schedule task already and.it didn't help.

The trace message is still needed. Without it, the ipc still randomly times out. Btw, I found this problem a.week ago with the fuzzer and I assumed the race was qemu related haha. Seems like it isn't.

lgirdwood commented 5 years ago

@ranj063 ok, probably worth attaching GDB via sof-fuzzer to check state. To rule out locking it may also be worth while replacing your trace with a loop (you can vary the length - make sure it's not optimized out by the compiler, use a volatile loop counter if needed).

ranj063 commented 5 years ago

@lgirdwood I will try adding to the fuzzer today but the fuzzer has the drawback of needing a 5ms between IPC's which isnt the case on the real hardware. I will update soon

ranj063 commented 5 years ago

@lgirdwood i'm able to reproduce the timeout with the fuzzer and here're the registers after the timeout. Does it tell you anything? @mmaka1 could you please also take a look? This is on BYT btw. (qemu) info registers

PC=ff2d1ebb

        LBEG=ff2d2ee2         LEND=ff2d2ef8       LCOUNT=00000000          SAR=0000000e
          BR=00000000    SCOMPARE1=00000000        ACCLO=00000000        ACCHI=00000000
         MR0=00000000          MR1=00000000          MR2=00000000          MR3=00000000
     PREFCTL=00000000  WINDOW_BASE=00000005 WINDOW_START=00000020 IBREAKENABLE=00000000
      MEMCTL=00000000      ATOMCTL=00000015     IBREAKA0=00000000     IBREAKA1=00000000
    DBREAKA0=00000000     DBREAKA1=00000000     DBREAKC0=00000000     DBREAKC1=00000000
   CONFIGID0=c2b3dbfe         EPC1=ff2d030c         EPC2=00000000         EPC3=ff2d2e9a
        EPC4=ff2d1ebb         EPC5=ff2d1ebb         EPC6=00000000         EPC7=00000000
        DEPC=00000000         EPS2=00000000         EPS3=00060720         EPS4=00060720
        EPS5=00060720         EPS6=00000000         EPS7=00000000    CONFIGID1=1c85483e
    EXCSAVE1=00000000     EXCSAVE2=ff2d2820     EXCSAVE3=ff2d28f4     EXCSAVE4=ff2d29c8
    EXCSAVE5=ff2d2a9c     EXCSAVE6=00000000     EXCSAVE7=00000000     CPENABLE=00000000
      INTSET=00000400     INTCLEAR=00000000    INTENABLE=00078f48           PS=00060720
     VECBASE=ff2c0400     EXCCAUSE=00000000   DEBUGCAUSE=00000000       CCOUNT=1f0db828
        PRID=00000000       ICOUNT=00000000  ICOUNTLEVEL=00000000     EXCVADDR=00000000
   CCOMPARE0=00000000    CCOMPARE1=00000000    CCOMPARE2=00000000 

THREADPTR=00000000 

 A00=bf2cc830  A01=ff327fa0  A02=ff303274  A03=00000047
 A04=0dead00a  A05=ff301e8c  A06=00000000  A07=00000000
 A08=00000000  A09=ff327f80  A10=ff3069cc  A11=ff301fb0
 A12=ff306800  A13=00000000  A14=00000000  A15=ff306928

AR00=ff306800 AR01=00000000 AR02=00000000 AR03=ff306928
AR04=bf2d037b AR05=ff327f60 AR06=0f86a442 AR07=00000000
AR08=00060720 AR09=00060023 AR10=00000000 AR11=00000000
AR12=bf2cb3f5 AR13=ff327f50 AR14=00000000 AR15=ff30693c
AR16=00060720 AR17=00000f00 AR18=0f86a92f AR19=0000000a
AR20=bf2cc830 AR21=ff327fa0 AR22=ff303274 AR23=00000047
AR24=0dead00a AR25=ff301e8c AR26=00000000 AR27=00000000
AR28=00000000 AR29=ff327f80 AR30=ff3069cc AR31=ff301fb0
ranj063 commented 5 years ago

And gdb bt shows: (gdb) bt

0 do_task_master_core (sof=0xff303274 )

at /home/ranjani/work/sof/sof/src/tasks/audio.c:76

1 0xff2cc830 in master_core_init (sof=)

at /home/ranjani/work/sof/sof/src/init/init.c:86

2 0xff2cc85c in main (argc=, argv=0xff301fb0)

at /home/ranjani/work/sof/sof/src/init/init.c:102
ranj063 commented 5 years ago

@tlauda could you please help take a look at the register info above too?

keqiaozhang commented 5 years ago

Today, I run the simultaneous playback/recording test on GLK-ubuntuOS with latest code base and got another type of IPC times out error: Trace:

 CORE  LEVEL      COMP_ID                TIMESTAMP            DELTA                FILE_NAME    CONTENT
    0      1          DMA          54143363.802083  54143364.000000 intel/cavs/hda-dma.c:279    hda-dmac: 5 wait for buffer full timeout
    0      1         HOST          54143370.104167         6.302083     src/audio/host.c:792    host_copy() error: dma_copy() failed, ret = 4294967234
    0      1         PIPE          54143376.041667         5.937500 src/audio/pipeline.c:681    pipeline_copy() error: ret = -62, start->comp.id = 12, dir = 1
    0      1         PIPE 2.13     54143382.968750         6.927083 src/audio/pipeline.c:789    pipeline_xrun_recover()
    0      1         COMP          54143395.468750        12.500000 rc/audio/component.c:201    comp_set_state() error: wrong state = 5, COMP_TRIGGER_PREPARE
    0      1         PIPE          54143400.937500         5.468750 src/audio/pipeline.c:365    pipeline_prepare() error: ret = -22,dev->comp.id = 8
    0      1         PIPE 2.13     54143406.770833         5.833333 src/audio/pipeline.c:796    pipeline_xrun_recover() error: pipeline_prepare() failed, ret = -22
    0      1         PIPE 2.13     54143412.083333         5.312500 src/audio/pipeline.c:865    pipeline_task(): xrun recover failed! pipeline will be stopped!

logger-t:

    0      2          DAI 2.12    279782845.781250         3.697917      src/audio/dai.c:539    dai_comp_trigger(), START
    0      1          DMA         279783073.437500       227.656250 intel/cavs/hda-dma.c:279    hda-dmac: 5 wait for buffer full timeout
    0      1         HOST         279783079.635417         6.197917     src/audio/host.c:792    host_copy() error: dma_copy() failed, ret = 4294967234
    0      1         PIPE         279783085.364583         5.729167 src/audio/pipeline.c:681    pipeline_copy() error: ret = -62, start->comp.id = 12, dir = 1
    0      1         PIPE 2.13    279783092.239583         6.875000 src/audio/pipeline.c:789    pipeline_xrun_recover()
    0      2         PIPE 2.13    279783096.979167         4.739583 src/audio/pipeline.c:358    pipeline_prepare()
    0      2         HOST         279783100.937500         3.958333     src/audio/host.c:635    host_prepare()
    0      1         COMP         279783104.739583         3.802083 rc/audio/component.c:201    comp_set_state() error: wrong state = 5, COMP_TRIGGER_PREPARE
    0      1         PIPE         279783110.260417         5.520833 src/audio/pipeline.c:365    pipeline_prepare() error: ret = -22,dev->comp.id = 8
    0      1         PIPE 2.13    279783116.250000         5.989583 src/audio/pipeline.c:796    pipeline_xrun_recover() error: pipeline_prepare() failed, ret = -22
    0      1         PIPE 2.13    279783121.510417         5.260417 src/audio/pipeline.c:865    pipeline_task(): xrun recover failed! pipeline will be stopped!

dmesg:

[  525.718085] sof-audio-pci 0000:00:0e.0: ipc tx: 0x60010000: GLB_STREAM_MSG: PCM_PARAMS
[  526.021046] sof-audio-pci 0000:00:0e.0: error: ipc timed out for 0x60010000 size 108
[  526.021052] sof-audio-pci 0000:00:0e.0: status: fw entered - code 00000005
[  526.021091] sof-audio-pci 0000:00:0e.0: error: unexpected fault 0x00000000 trace 0x00004000
[  526.021092] sof-audio-pci 0000:00:0e.0: error: waking up any trace sleepers
[  526.021094] sof-audio-pci 0000:00:0e.0: error: hw params ipc failed for stream 2
[  526.021097] sof-audio-pci 0000:00:0e.0: ASoC: 0000:00:0e.0 hw params failed: -110
[  526.021100]  Headset: ASoC: hw_params FE failed -110

dmesg.log logger.log logger-t.log

Test recipe: sof/tplg: master/50bd96d kernel: sof-dev/9086043

lgirdwood commented 5 years ago

@tlauda @ranj063 looks like if we get a spurious IPC interrupt on BYT (not checked others yet) we dont clear it. The INTSET=00000400 INTCLEAR=00000000 INTENABLE=00078f48 shows the IPC IRQ is still set.....and this can race with driver setting/clearing BUSY/DONE bits.

static void irq_handler(void *arg)
{
    uint32_t isr;
    uint32_t imrd;

    /* Interrupt arrived, check src */
    isr = shim_read(SHIM_ISRD);
    imrd = shim_read(SHIM_IMRD);

    tracev_ipc("ipc: irq isr 0x%x", isr);

    /* reply message(done) from host */
    if (isr & SHIM_ISRD_DONE &&
        !(imrd & SHIM_IMRD_DONE)) {

        /* Mask Done interrupt before return */
        shim_write(SHIM_IMRD, shim_read(SHIM_IMRD) | SHIM_IMRD_DONE);
        interrupt_clear(PLATFORM_IPC_INTERRUPT);
        do_notify();
    }

    /* new message from host */
    if (isr & SHIM_ISRD_BUSY &&
        !(imrd & SHIM_IMRD_BUSY)) {

        /* Mask Busy interrupt before return */
        shim_write(SHIM_IMRD, shim_read(SHIM_IMRD) | SHIM_IMRD_BUSY);
        interrupt_clear(PLATFORM_IPC_INTERRUPT);

        /* TODO: place message in Q and process later */
        /* It's not Q ATM, may overwrite */
        if (_ipc->host_pending) {
            trace_ipc_error("ipc: dropping msg");
            trace_ipc_error(" isr 0x%x imrd 0x%x ipcxh 0x%x",
                    isr, shim_read(SHIM_IMRD),
                    shim_read(SHIM_IPCXH));
        } else {
            _ipc->host_pending = 1;
            ipc_schedule_process(_ipc);
        }
    }
}
ranj063 commented 5 years ago

@lgirdwood how should we deal with this?

mmaka1 commented 5 years ago

@keqiaozhang Please keep GLK IPC related issues separated from BYT reports since there is different IPC HW in use. It is great to look at all of them and try to find some common root causes if there are any but most of the issues are very close to the platform specific HW (IPC HW, INT HW integration) and different piece of HW driver SOF code might be involved. [edit] I'm sorry @keqiaozhang , I missed this was BYT mixed into the thread initially created for GLK, not the other way around, so this comment should target BYT reports, not your GLK ones, in general still applies ;)

lgirdwood commented 5 years ago

@mmaka1 fwiw, I'm beginning to suspect scheduling here (since it's appearing on GLK + CML). IPC and Volume ramps (which reschedule themselves) do schedule asynchronous tasks and we may have a race as enabling a trace item seemed to help.

tlauda commented 5 years ago

@lgirdwood Above timeout reported by @keqiaozhang is waiting for buffer full timeout, so this is not typical IPC timeout. It should be put into separated issue. That particular error has two potential rootcauses:

  1. Driver doesn't program SDxFMT register after clearing PROCEN (which I believe has been already fixed).
  2. This preload loop should be executed as idle task, but if there is any other interrupt coming between trigger start IPC task and going into passive level it will be scheduled too early. It should be fixed by default with #1124.
lgirdwood commented 5 years ago

@tlauda thanks I heard from @mmaka1 you have been deep diving the HW side. :) @mmaka1 @ranj063 and I had a quick call and @ranj063 will double check FW side to rule out locking, list manipulation etc in scheduler and IRQ logic.

lgirdwood commented 5 years ago

@ranj063 this should change LL scheduler to IRQ level 1, EDF to level 4 and IPC IRQ at level 3

diff --git a/src/platform/intel/cavs/platform.c b/src/platform/intel/cavs/platform.c
index cab93797..dc1a455d 100644
--- a/src/platform/intel/cavs/platform.c
+++ b/src/platform/intel/cavs/platform.c
@@ -153,14 +153,14 @@ static const struct sof_ipc_window sram_window = {
 struct timesource_data platform_generic_queue[] = {
 {
        .timer   = {
-               .id = TIMER3, /* external timer */
-               .irq = IRQ_EXT_TSTAMP0_LVL2(0),
+               .id = TIMER1, /* external timer */
+               .irq = IRQ_NUM_TIMER1,
        },
        .clk            = CLK_SSP,
        .notifier       = NOTIFIER_ID_SSP_FREQ,
-       .timer_set      = platform_timer_set,
-       .timer_clear    = platform_timer_clear,
-       .timer_get      = platform_timer_get,
+       .timer_set      = timer_set,
+       .timer_clear    = timer_clear,
+       .timer_get      = timer_get_system,
 },
 {
        .timer   = {
lgirdwood commented 5 years ago

@mmaka1 @ranj063 @tlauda looks like our IRQ handler will miss iicoming IRQs at the same level when it's processing.

static inline void irq_lvl2_handler(void *data, int level, uint32_t ilxsd,
                    uint32_t ilxmsd, uint32_t ilxmcd)
{
    struct irq_desc *parent = (struct irq_desc *)data;
    struct irq_desc *child = NULL;
    struct list_item *clist;
    uint32_t status;
    uint32_t i = 0;
    uint32_t unmask = 0;

    /* mask the parent IRQ */
    arch_interrupt_disable_mask(1 << level);

    /* mask all child interrupts */
    status = irq_read(ilxsd);
    irq_write(ilxmsd, status);

    /* handle each child */
    while (status) { <<<<<<<<<<<<<<<<<<<<<<<<<<< status is now stale after 1st loop iteration.
        /* any IRQ for this child bit ? */
        if ((status & 0x1) == 0)
            goto next;

        /* get child if any and run handler */
        list_for_item(clist, &parent->child[i]) {
            child = container_of(clist, struct irq_desc, irq_list);

            if (child && child->handler) {
                child->handler(child->handler_arg);
                unmask = child->unmask;
            } else {
                /* nobody cared ? */
                trace_irq_error("irq_lvl2_handler() error: "
                        "nbc");
            }
        }

        /* unmask this bit i interrupt */
        if (unmask)
            irq_write(ilxmcd, 0x1 << i);

next:
        status >>= 1;
        i++;
    }

    /* clear parent and unmask */
    arch_interrupt_clear(level);
    arch_interrupt_enable_mask(1 << level);
}
ranj063 commented 5 years ago

@lgirdwood testing now. Thanks

ranj063 commented 5 years ago

@lgirdwood with your patch in https://github.com/thesofproject/sof/issues/1263#issuecomment-484941979, the sound card fails to register. So maybe this change is too aggressive?

ranj063 commented 5 years ago

@lgirdwood @mmaka1 I did a quick test that @xiulipan and I discussed. I removed the call to schedule the vol_work in volume_ctrl_set_cmd() and I hit the same IPC timeout and this time a lot sooner. So I doubt this is a problem with interrupts getting missed but rather something more basic than that.

tlauda commented 5 years ago

@ranj063 Can you try to do counting of handled volume IPCs? You could increment FW status register in memory window right before clearing BUSY bit. I would like to see if it's in fact FW lack of response or maybe host misses it.

ranj063 commented 5 years ago

I have a theory but Im not sure if it is right but let me try anyway.

We had the same volume test pass in the past and back then the host was scheduling a work for every new IPC that was sent. But recently, we've removed the work queue and we send the IPC's as soon as the DSP sends a DONE interrupt for the previous IPC. So I think that the context switching between IPC's between successive IPC's was helping. With that not happening anymore, maybe the DSP isnt able to handle the high rate of interrupts? @lgirdwood @mmaka1 is that a possibility?

ranj063 commented 5 years ago

@lgirdwood it looks like our theory might be correct.With you PR https://github.com/thesofproject/sof/pull/1290, I run into the same IPC time out when I flood the fw with test ipc;s(within 2000 iterations) error: ipc timed out for 0xb0010000 size 8 [ 91.679614] sof-audio-pci 0000:00:0e.0: status: fw entered - code 00000005 [ 91.679823] sof-audio-pci 0000:00:0e.0: error: unexpected fault 0x00000000 trace 0x00004000 [ 91.679826] sof-audio-pci 0000:00:0e.0: error: waking up any trace sleepers

Increasing the ipc task deadline to 200 (from 100) in the FW improves the test result but still fails within 10K iterations. It seems to suggest a bug in the EDF scheduler to me.

Actually after a bit of testing, I realised that the deadline doesn't seem to have much effect. The flood test just randomly fails within 10K iterations

RanderWang commented 5 years ago

@lgirdwood @ranj063 @tlauda @xiulipan @keyonjie @singalsu

Keyon found the one reason and discussed with me. I think it should be a root cause. User would send IPC cmd to FW to change volume and FW schedules a task vol_work to change the volume slowly and continuously.

A simple model for volume changing: change volume from 20 to 80. vol_work would be scheduled multiple times to change volume with a small step each time.

vol_work()
{
    step = 4;
     if  volume < target volume
           volume += step;
           return 1ms to make it rescheduled after 1ms
      else
           return 0 to stop reschedule
}

For this case, the vol_work would be scheduled 15 times to change volume from 20 to 80 and spend 15ms. So If you send too much IPCs for volume in less than 15ms, all these IPCs could not be processed and finally timeout.

RanderWang commented 5 years ago

We could refine the work flow of volume adjustment. If the last volume task is still working, just change the value of target volume, return ok and the vol_work would dynamic check the target volume value then change the volume.

lgirdwood commented 5 years ago

@RanderWang it's irrelevant how long the IPC work takes. IPC will only signal DONE when ready to accept new messages. IPC can schedule work and that work is run in a different context. Flood IPC must 10k iterations.

@ranj063 DSP HW can easily cope with high IPC rate, but in light of recent driver updates we have exposed a race in FW between IPC and scheduling. Can you craete a PR for your kernel flood test to allow the others to test too.

keyonjie commented 5 years ago

I am thinking out something improvement like this, can anybody working on this issue give a try if it take any effect:

diff --git a/src/audio/volume.c b/src/audio/volume.c
index 87d0c6a8..3ad8ee91 100644
--- a/src/audio/volume.c
+++ b/src/audio/volume.c
@@ -139,8 +139,10 @@ static uint64_t vol_work(void *data)
        /* do we need to continue ramping */
        if (again)
                return VOL_RAMP_US;
-       else
+       else {
+               cd->in_ramping = 0;
                return 0;
+       }
 }

 /**
@@ -320,7 +322,10 @@ static int volume_ctrl_set_cmd(struct comp_dev *dev,
                                                   "invalid i = %u", i);
                        }
                }
-               schedule_task(&cd->volwork, VOL_RAMP_US, 0, 0);
+               if (!cd->in_ramping) {
+                       cd->in_ramping = 1;
+                       schedule_task(&cd->volwork, VOL_RAMP_US, 0, 0);
+               }
                break;

        case SOF_CTRL_CMD_SWITCH:
diff --git a/src/audio/volume.h b/src/audio/volume.h
index e27d1efb..633095ac 100644
--- a/src/audio/volume.h
+++ b/src/audio/volume.h
@@ -110,6 +110,7 @@ struct comp_data {
        int32_t volume[SOF_IPC_MAX_CHANNELS];   /**< current volume */
        int32_t tvolume[SOF_IPC_MAX_CHANNELS];  /**< target volume */
        int32_t mvolume[SOF_IPC_MAX_CHANNELS];  /**< mute volume */
+       unt32_t in_ramping; /**< in volume stage */
        /**< volume processing function */
        void (*scale_vol)(struct comp_dev *dev, struct comp_buffer *sink,
                struct comp_buffer *source, uint32_t frames);
lgirdwood commented 5 years ago

@keyonjie this has nothing to do with volume, please read comment from @ranj063 about flood IPC.

keqiaozhang commented 5 years ago

@keyonjie I still can reproduce this issue on both GLK and WHL with your changes.

tlauda commented 5 years ago

@ranj063 @lgirdwood IPC timeout is too generic and we are circling around. We need to identify what has actually happened:

  1. FW missed the IPC?
  2. FW got the IPC but didn't respond?
  3. FW got the IPC, responded properly, driver missed the response?

Also we could try to disable ll scheduler during the test. Timer is on the same irq level as ipc handler, so maybe we still have some kind of bug/race condition there.

lgirdwood commented 5 years ago

@tlauda the flood IPC test carried out by @ranj063 sends an IPC that the FW will schedule and return 0; It does nothing. See PR https://github.com/thesofproject/sof/pull/1290. This means there is no volume or LL scheduling, it's simple a race between EDF and IPC.

tlauda commented 5 years ago

@lgirdwood Agent is scheduled on LL, so we always have tick of timer every 1 ms.

lgirdwood commented 5 years ago

@tlauda SA should tick every 750ms on APL, but it could be easily commented out. Trace will also be every 500ms or shorter.

tlauda commented 5 years ago

@lgirdwood Timer is configured to trigger every 1 ms, when there is at least one task in the queue. It's like that in order to support stream scheduling across different cores.

lgirdwood commented 5 years ago

@tlauda I've found a race

while (1) {
        /* sleep until next IPC or DMA */
        sa_enter_idle(sof);
        wait_for_interrupt(0)   <<<<<< 1) LL IRQ wakes us here.

        /* now process any IPC messages to host */
        ipc_process_msg_queue();

        /* schedule any idle tasks */
        schedule();
                                                     <<<<<<< 2) IPC IRQ happens here.
    }

IRQ 2 is never serviced until next LL IRQ which could timeout.

lgirdwood commented 5 years ago

@tlauda can you point me at the code where we do the 1ms tick when there is a task, I assume this is using LL ?

tlauda commented 5 years ago

@lgirdwood

@tlauda can you point me at the code where we do the 1ms tick when there is a task, I assume this is using LL ?

ll_chedule.c, line 116: ticks = queue_calc_next_timeout(queue, ll_get_timer(queue)). This is the place, where we check the total number of tasks and start the timer if any is detected.

@tlauda I've found a race

I don't think it's a race. IPC IRQ will still be processed, since we are on passive level.

ranj063 commented 5 years ago

@lgirdwood @tlauda https://github.com/thesofproject/linux/pull/852 is the kernel PR.

I did quite a bit of testing last night and after I switched to the xcc compiled version of the FW, this issue happens far less frequently (after about 200K IPC's). Moreover, it is not reproduced on the Up2 but seen on the GLK chromebook.

I am going to modify the test a little bit further to add some payload to the test ipc as well apart from just having the header.

ranj063 commented 5 years ago

@lgirdwood @tlauda I was wrong about the UP2. With the attached script, I'm able to reproduce the timeout even on the UP2 quite quickly. test_ipc_flood.txt

@keqiaozhang could you please give it a try and see if you can reproduce it too?