[BUG][BYT-NOCODEC][BDW_WSB_RT286]Input/output error when simultaneous-playback-capture / multiple-pipeline-playback

Liviali155 commented 4 years ago

Describe the bug Input/output error when do simultaneous-playback-capture test After error occured all the pipeline can work

To Reproduce 1."sudo reboot" to reboot system 2.cd sof-test 3.cd test-case 4.export TPLG=sof-byt-nocodec.tplg 5../simultaneous-playback-capture.sh -l 100

Reproduction Rate 1 round: failed at 13/100

Expected behavior No error occured

Impact Input/output error when do simultaneous-playback-capture test of aplay(0,0) and arecord (0,0)

ubuntu@jf-byt-mb-nocodec-1:~$ aplay -l
**** List of PLAYBACK Hardware Devices ****
card 0: sofnocodec [sof-nocodec], device 0: PCM (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: sofnocodec [sof-nocodec], device 1: PCM Deep Buffer (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0

ubuntu@jf-byt-mb-nocodec-1:~$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 0: sofnocodec [sof-nocodec], device 0: PCM (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0

No error dmesg,no error sof error trace

Environment Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver). Kernel: {sof-dev fa7850de} SOF: {master:318dc9f7} Name of the topology file Topology: {sof-byt-nocodec.tplg } Name of the platform(s) on which the bug is observed. Platform: {BYT MB with nocodec}

dmesg0713.log sof-logger0713.log

Liviali155 commented 4 years ago

On BSW with onboard codec MAX98090 in I2S mode also has this issue with sof-dev(9eb3d58)+master(5564a902)

Failed at 39/50

ubuntu@jf-bsw-cyn-max98090-3:~/sof-test/test-case$ aplay -l
**** List of PLAYBACK Hardware Devices ****
card 0: max98090 [sof-bytcht max98090], device 0: PCM (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: max98090 [sof-bytcht max98090], device 1: PCM Deep Buffer (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: Device [USB Audio Device], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

ubuntu@jf-bsw-cyn-max98090-3:~/sof-test/test-case$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 0: max98090 [sof-bytcht max98090], device 0: PCM (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: Device [USB Audio Device], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

dmesg0714.log sof-logger0714.log

mengdonglin commented 4 years ago

@slawblauciak sof-byt-nocodec.tplg is created from "sof-cht-nocodec.m4", updated in https://github.com/thesofproject/sof/pull/3080 to remove SRC for Baytrail and CherryTrail by @plbossart

plbossart commented 4 years ago

@Liviali155 can you retry with the SOF PR https://github.com/thesofproject/sof/pull/3245

The symptoms of no dmesg error, no trace and the -EIO error seem completely aligned with my findings on those platforms.

Liviali155 commented 4 years ago

After applied #3245,issue still can be reproduced on BSW with onboard codec MAX98090 in I2S mode and BYT MB with nocodec

dmesg0729.log sof-logger0729.log

plbossart commented 4 years ago

@Liviali155 can you retry with both #3245 and https://github.com/thesofproject/sof/pull/3257

Somehow I have the feeling we have the same problem of not having enough memory resulting in some sort of underflow error. It's not clear e.g. why we have these repeated sof-logger messages in both issues

[241094039.427083] (241094032.000000) c0 buffer       3.18        src/audio/buffer.c:211  comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241094042.916667] (        3.489583) c0 buffer       1.5         src/audio/buffer.c:172  comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5
[241095038.958333] (      996.041687) c0 buffer       3.18        src/audio/buffer.c:211  comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241095043.697917] (        4.739583) c0 buffer       1.5         src/audio/buffer.c:172  comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5

Seems to me like the firmware gets lost with bad pointers and can't recover in both https://github.com/thesofproject/sof/issues/3170 and https://github.com/thesofproject/sof/issues/3171

@mmaka1 @lgirdwood FYI

plbossart commented 4 years ago

@Liviali155 Also wondering if 61a2c75bf8335f93d0b223c85ed2c79d406d968b (' dma: dw: fix locking and calculations in dw_dma_get_data_size') has an impact on multiple pipelines. This changes the behavior for dmic, I am trying to see if reverting it can help. Can you also try on your side.

plbossart commented 4 years ago

@Liviali155 Adding PR#3245, @3257 and reverting 61a2c75bf8335f93d0b223c85ed2c79d406d968b seems to solve the issue for 100 iterations. branch here: https://github.com/plbossart/sof/tree/fix/multi-pipelines I will launch a longer test on my side.

Liviali155 commented 4 years ago

@plbossart Used https://github.com/plbossart/sof/tree/fix/multi-pipelines(commit:d9c6405a) to test,issue still can be reproduced on byt-nocodec , failed at 287/1000, seems the reproduce rate is lower than before

sof-logger0730.log dmesg0730.log

plbossart commented 4 years ago

Ack @Liviali155, same on my side. My first run worked for 100 iteration, but on the second a failure happened on iteration 9/1000.

We need to figure this one out, something's not right with scheduling/concurrency. @lgirdwood FYI

lgirdwood commented 4 years ago

and will result in the above warnings. @plbossart can you turn on mixer debug and see if this aligns, I'm also wondering if mixers is used on the other platforms.

I don't know how to turn on mixer debug?

Mixers are used in all platforms, but here indeed the use of the mixer is different: it's the first element in a 'DAI' pipeline

lgirdwood commented 4 years ago

and will result in the above warnings. @plbossart can you turn on mixer debug and see if this aligns, I'm also wondering if mixers is used on the other platforms.

I don't know how to turn on mixer debug?

Oh, I think individual debug is still blocked on UUID PR, Maybe just be easier to change trace_dbg to trace_err in mixer_copy().

Mixers are used in all platforms, but here indeed the use of the mixer is different: it's the first element in a 'DAI' pipeline

Yep, but I no longer see a sof-byt-nocodec.m4 topology in master, so could your topology binary be stale here and potentially causing an issue ?

I do suspect that mixer is not correctly getting the state of it's sources/sinks and this results in 0 bytes to copy, but then I've no idea why does not complain about over/under runs....

lyakh commented 4 years ago

Yep, but I no longer see a sof-byt-nocodec.m4 topology in master, so could your topology binary be stale here and potentially causing an issue ?

@lgirdwood the topology is there, it's generated from the cht m4.

I do suspect that mixer is not correctly getting the state of it's sources/sinks and this results in 0 bytes to copy, but then I've no idea why does not complain about over/under runs....

I see the PGA component reporting 0 available frames when this bug occurs...

lgirdwood commented 4 years ago

I do suspect that mixer is not correctly getting the state of it's sources/sinks and this results in 0 bytes to copy, but then I've no idea why does not complain about over/under runs....

I see the PGA component reporting 0 available frames when this bug occurs...

Please let me know if you see the DMA complaining about uder/overruns. If we dont see this then the data could be getting stuck in the PGA or mixer. Btw, there may be some PCM converter between DMA and PGA that could block too.

lyakh commented 4 years ago

Please let me know if you see the DMA complaining about uder/overruns. If we dont see this then the data could be getting stuck in the PGA or mixer. Btw, there may be some PCM converter between DMA and PGA that could block too.

@lgirdwood no, I don't see any of those

lyakh commented 4 years ago

I'm wondering why the generic nocodec topology sof-cht-nocodec.m4 uses DMA schedulers?

plbossart commented 4 years ago

I'm wondering why the generic nocodec topology sof-cht-nocodec.m4 uses DMA schedulers?

all byt/cht topologies use DMA schedulers, it's not limited to the nocodec case.

lyakh commented 4 years ago

I'm curious about scheduling domains. In some cases (like in UP2 case) they seem to be freely replaceable - you can use one or another. In other cases (like BYT nocodec) only one works. What are the conditions for each domain to be applicable? And it does look like the DMA scheduling domain has got some issues.

lgirdwood commented 4 years ago

@lyakh in principle they should do the same thing, that is schedule timely pipeline work, but there are some differences in implementation, synchronisation and maybe IRQ runlevel. i.e. they are both triggered on IRQs, the timer domain schedules all work in order, whereas the DMA domain probably schedules on DMA IRQ and this may be asynchronous to other pipelines (and could block on other work finishing).

plbossart commented 4 years ago

The DMA scheduling doesn't work for HDaudio link DMAs, not interrupts are generated so you HAVE to use the timer-based scheduling for all HDAudio pipelines. For SSP and DMIC, I think the two cases are equivalent, but the timer might be more efficient since there's only one tick and you can take care of all pipelines. I have never seen any data showing that 1 ms interrupt is actually a problem though.

plbossart commented 4 years ago

And to build on this, even for Baytrail in master mode, the legacy closed-source firmware did not use DMA interrupts but also a 1ms external timer ticks, so I will assert that for the SSP using the timer or the DMA interrupt is essentially the same. I think the choice was more a case of not having to validate baytrail/cherrytrail, initially all scheduling was DMA based for early platforms and it stayed that way due to code inertia.

lyakh commented 4 years ago

@lgirdwood @plbossart thanks! I've tried blatantly replacing the DMA domain with the timer domain in the BYT nocodec topology and it isn't even loading now - DW DMA errors out with some missing configuration. Investigating. EDIT: it is loading, it's failing later when trying to configure the first pipeline.

paulstelian97 commented 4 years ago

Hm so DMA scheduling is looked at more of as a legacy that is supported rather than the recommended way?

plbossart commented 4 years ago

Hm so DMA scheduling is looked at more of as a legacy that is supported rather than the recommended way?

If the interface is slave to an external device, the DMA scheduling is required. When the interface is master and synchronous with the timer tick, switching the two is a revalidation effort but I don't see how the performance might differ on paper. But as @lyakh shows above, in practice there might be implementation issues.

Edit: to be clear, for Intel only the SSP can be slave to an external clock, the HDaudio, DMIC and SoundWire interfaces are all clock masters and the clocks are synchronous with the timers.

lyakh commented 4 years ago

To recap: I've found out that all "legacy" platforms (BYT, CHT, BDW, etc.) use DMA scheduling. An attempt to switch byt-nocodec to timer scheduling failed with firmware errors, which I since then have tried to debug and fix.

I've found the reason why this doesn't work: the firmware DW DMA driver fails to set configuration in dw_dma_set_config() with an error:

ERROR dw_dma_set_config(): dma 1 channel 1 not enough elems for config with irq disabled 1

This doesn't fail with DMA scheduling because then the .irq_disabled flag isn't set and the driver then doesn't even check how many elements the configuration specifies.

This doesn't fail on non-legacy platforms, because they don't specify CONFIG_HOST_PTABLE, then the COMP_ATTR_HOST_BUFFER host / PCM attribute isn't set, so in create_local_elems() the hd->config SG array is allocated with buffer_count (5) elements and not with 1.

I tried fixing the above problem by allocating the necessary minimum (3) of SG elements in create_local_elems() and then also by changing DW_DMA_BUFFER_PERIOD_COUNT for !CONFIG_HW_LLI case to 3 too. That fixed two instances of DMA configuration failure but then the firmware failed later anyway.

So, it looks like "legacy platforms" have multiple problems with timer-driven scheduling. It might be our best option ATM to make this a hard rule somewhere and try to fix DMA scheduling which we need anyway.

lgirdwood commented 4 years ago

I've found the reason why this doesn't work: the firmware DW DMA driver fails to set configuration in dw_dma_set_config() with an error:

ERROR dw_dma_set_config(): dma 1 channel 1 not enough elems for config with irq disabled 1

This doesn't fail with DMA scheduling because then the .irq_disabled flag isn't set and the driver

This is a rule for when DMA uses HW LLI mode (since there is a race between writing back LL descriptors and resetting them when 2 periods are used). This rule should not apply for SW LLI on BYT.

lyakh commented 4 years ago

In the BYT case the playback topology comprises of 3 pipelines: DAI+volume+mixer, and 2 PCM+volume pipelines, connecting to that mixer. Usually during the test first the playback starts, I see the scheduler scheduling the two playback pipelines one after another. Then a bit later the capture process starts too, from which point now 3 tasks are always scheduled one after another.

But sometimes when the capture starts, I see that the scheduler only tries to schedule the capture and the playback DAI tasks, and the PCM pipeline task is stuck in the RESCHEDULE state and not rescheduled any more.

plbossart commented 4 years ago

Impressive set of findings @lyakh, I am so glad we have you dig into all this!

keqiaozhang commented 3 years ago

CI observed this issue again on BSW_CYN_MAX98090 and BYT_MB_NOCODEC. http://sof-ci.sh.intel.com/#/result/planresultdetail/1441?model=BSW_CYN_MAX98090&testcase=simultaneous-playback-capture-50
http://sof-ci.sh.intel.com/#/result/planresultdetail/1441?model=BYT_MB_NOCODEC&testcase=simultaneous-playback-capture-50

plbossart commented 3 years ago

@keqiaozhang is there a way we can bisect to see when the problem re-appeared?

plbossart commented 3 years ago

@lgirdwood This issue remains visible in recent Intel daily tests, it's still a problem.

XiaoyunWu6666 commented 3 years ago

[241094039.427083] (241094032.000000) c0 buffer       3.18        src/audio/buffer.c:211  comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241094042.916667] (        3.489583) c0 buffer       1.5         src/audio/buffer.c:172  comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5
[241095038.958333] (      996.041687) c0 buffer       3.18        src/audio/buffer.c:211  comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241095043.697917] (        4.739583) c0 buffer       1.5         src/audio/buffer.c:172  comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5

buffer warning 'no bytes to produce' and ' no bytes to consume' don't appear these days . but we can still get ' WARN dai_copy(): nothing to copy' from dai

example : inner dailytest 4705;model=BDW_WSB_RT286;testcase=multiple-pipeline-playback-50

keyonjie commented 2 years ago

We are still seeing this in recent daily report.

XiaoyunWu6666 commented 2 years ago

in inner daily 9751 and 9715, when check-playback/check-capture on BDW_WSR_RT286 , IO error happen in the first play

lgirdwood commented 2 years ago

@XiaoyunWu6666 I'm suspicious we have over budget MCPS given that both are HiFi2 and will use the generic C processing with the frag API. @singalsu fyi - lets retest this again after all the frag APIs users have been fixed.

plbossart commented 2 years ago

FWIW we seem to have an interrupt issue on Broadwell https://github.com/thesofproject/linux/issues/3400

marc-hb commented 2 years ago

Still happening in daily 10146?model=BDW_WSB_RT286&testcase=multiple-pipeline-capture-50

Start Time: 2022-02-11 22:27:26 UTC Kernel Branch: topic/sof-dev Kernel Commit: 98119478 SOF Branch: main SOF Commit: b8954754f055

mengdonglin commented 2 years ago

As Broadwell (BDW) is a very old platform, we lower its priority and will not fix issues with multi-pipeline test cases on BDW.

marc-hb commented 2 years ago

The test still failed yesterday in 10402?model=BDW_WSB_RT286&testcase=multiple-pipeline-capture-50, let's close this when https://github.com/thesofproject/sof-test/pull/863 is merged so we stop testing this every day like someone is assigned to it.

We had only 6 distinct failures in 10402 and this was one of them.

XiaoyunWu6666 commented 2 years ago

@marc-hb I think we can close this again since https://github.com/intel-innersource/drivers.audio.ci.sof-framework/pull/185 got merged and also see current test 10444 [console]

test case multiple-pipeline-capture-50.sh is SKIP!
Catch ignore field of test-case: won't fix https://github.com/thesofproject/sof/issues/3170!

thesofproject / sof

[BUG][BYT-NOCODEC][BDW_WSB_RT286]Input/output error when simultaneous-playback-capture / multiple-pipeline-playback #3170