Closed Liviali155 closed 2 years ago
On BSW with onboard codec MAX98090 in I2S mode also has this issue with sof-dev(9eb3d58)+master(5564a902)
Failed at 39/50
ubuntu@jf-bsw-cyn-max98090-3:~/sof-test/test-case$ aplay -l
**** List of PLAYBACK Hardware Devices ****
card 0: max98090 [sof-bytcht max98090], device 0: PCM (*) []
Subdevices: 1/1
Subdevice #0: subdevice #0
card 0: max98090 [sof-bytcht max98090], device 1: PCM Deep Buffer (*) []
Subdevices: 1/1
Subdevice #0: subdevice #0
card 1: Device [USB Audio Device], device 0: USB Audio [USB Audio]
Subdevices: 1/1
Subdevice #0: subdevice #0
ubuntu@jf-bsw-cyn-max98090-3:~/sof-test/test-case$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 0: max98090 [sof-bytcht max98090], device 0: PCM (*) []
Subdevices: 1/1
Subdevice #0: subdevice #0
card 1: Device [USB Audio Device], device 0: USB Audio [USB Audio]
Subdevices: 1/1
Subdevice #0: subdevice #0
@slawblauciak sof-byt-nocodec.tplg is created from "sof-cht-nocodec.m4", updated in https://github.com/thesofproject/sof/pull/3080 to remove SRC for Baytrail and CherryTrail by @plbossart
@Liviali155 can you retry with the SOF PR https://github.com/thesofproject/sof/pull/3245
The symptoms of no dmesg error, no trace and the -EIO error seem completely aligned with my findings on those platforms.
After applied #3245,issue still can be reproduced on BSW with onboard codec MAX98090 in I2S mode and BYT MB with nocodec
@Liviali155 can you retry with both #3245 and https://github.com/thesofproject/sof/pull/3257
Somehow I have the feeling we have the same problem of not having enough memory resulting in some sort of underflow error. It's not clear e.g. why we have these repeated sof-logger messages in both issues
[241094039.427083] (241094032.000000) c0 buffer 3.18 src/audio/buffer.c:211 comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241094042.916667] ( 3.489583) c0 buffer 1.5 src/audio/buffer.c:172 comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5
[241095038.958333] ( 996.041687) c0 buffer 3.18 src/audio/buffer.c:211 comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241095043.697917] ( 4.739583) c0 buffer 1.5 src/audio/buffer.c:172 comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5
Seems to me like the firmware gets lost with bad pointers and can't recover in both https://github.com/thesofproject/sof/issues/3170 and https://github.com/thesofproject/sof/issues/3171
@mmaka1 @lgirdwood FYI
@Liviali155 Also wondering if 61a2c75bf8335f93d0b223c85ed2c79d406d968b (' dma: dw: fix locking and calculations in dw_dma_get_data_size') has an impact on multiple pipelines. This changes the behavior for dmic, I am trying to see if reverting it can help. Can you also try on your side.
@Liviali155 Adding PR#3245, @3257 and reverting 61a2c75bf8335f93d0b223c85ed2c79d406d968b seems to solve the issue for 100 iterations. branch here: https://github.com/plbossart/sof/tree/fix/multi-pipelines I will launch a longer test on my side.
@plbossart Used https://github.com/plbossart/sof/tree/fix/multi-pipelines(commit:d9c6405a) to test,issue still can be reproduced on byt-nocodec , failed at 287/1000, seems the reproduce rate is lower than before
Ack @Liviali155, same on my side. My first run worked for 100 iteration, but on the second a failure happened on iteration 9/1000.
We need to figure this one out, something's not right with scheduling/concurrency. @lgirdwood FYI
and will result in the above warnings. @plbossart can you turn on mixer debug and see if this aligns, I'm also wondering if mixers is used on the other platforms.
I don't know how to turn on mixer debug?
Mixers are used in all platforms, but here indeed the use of the mixer is different: it's the first element in a 'DAI' pipeline
and will result in the above warnings. @plbossart can you turn on mixer debug and see if this aligns, I'm also wondering if mixers is used on the other platforms.
I don't know how to turn on mixer debug?
Oh, I think individual debug is still blocked on UUID PR, Maybe just be easier to change trace_dbg to trace_err in mixer_copy().
Mixers are used in all platforms, but here indeed the use of the mixer is different: it's the first element in a 'DAI' pipeline
Yep, but I no longer see a sof-byt-nocodec.m4 topology in master, so could your topology binary be stale here and potentially causing an issue ?
I do suspect that mixer is not correctly getting the state of it's sources/sinks and this results in 0 bytes to copy, but then I've no idea why does not complain about over/under runs....
Yep, but I no longer see a sof-byt-nocodec.m4 topology in master, so could your topology binary be stale here and potentially causing an issue ?
@lgirdwood the topology is there, it's generated from the cht m4.
I do suspect that mixer is not correctly getting the state of it's sources/sinks and this results in 0 bytes to copy, but then I've no idea why does not complain about over/under runs....
I see the PGA component reporting 0 available frames when this bug occurs...
I do suspect that mixer is not correctly getting the state of it's sources/sinks and this results in 0 bytes to copy, but then I've no idea why does not complain about over/under runs....
I see the PGA component reporting 0 available frames when this bug occurs...
Please let me know if you see the DMA complaining about uder/overruns. If we dont see this then the data could be getting stuck in the PGA or mixer. Btw, there may be some PCM converter between DMA and PGA that could block too.
Please let me know if you see the DMA complaining about uder/overruns. If we dont see this then the data could be getting stuck in the PGA or mixer. Btw, there may be some PCM converter between DMA and PGA that could block too.
@lgirdwood no, I don't see any of those
I'm wondering why the generic nocodec topology sof-cht-nocodec.m4 uses DMA schedulers?
I'm wondering why the generic nocodec topology sof-cht-nocodec.m4 uses DMA schedulers?
all byt/cht topologies use DMA schedulers, it's not limited to the nocodec case.
I'm curious about scheduling domains. In some cases (like in UP2 case) they seem to be freely replaceable - you can use one or another. In other cases (like BYT nocodec) only one works. What are the conditions for each domain to be applicable? And it does look like the DMA scheduling domain has got some issues.
@lyakh in principle they should do the same thing, that is schedule timely pipeline work, but there are some differences in implementation, synchronisation and maybe IRQ runlevel. i.e. they are both triggered on IRQs, the timer domain schedules all work in order, whereas the DMA domain probably schedules on DMA IRQ and this may be asynchronous to other pipelines (and could block on other work finishing).
The DMA scheduling doesn't work for HDaudio link DMAs, not interrupts are generated so you HAVE to use the timer-based scheduling for all HDAudio pipelines. For SSP and DMIC, I think the two cases are equivalent, but the timer might be more efficient since there's only one tick and you can take care of all pipelines. I have never seen any data showing that 1 ms interrupt is actually a problem though.
And to build on this, even for Baytrail in master mode, the legacy closed-source firmware did not use DMA interrupts but also a 1ms external timer ticks, so I will assert that for the SSP using the timer or the DMA interrupt is essentially the same. I think the choice was more a case of not having to validate baytrail/cherrytrail, initially all scheduling was DMA based for early platforms and it stayed that way due to code inertia.
@lgirdwood @plbossart thanks! I've tried blatantly replacing the DMA domain with the timer domain in the BYT nocodec topology and it isn't even loading now - DW DMA errors out with some missing configuration. Investigating. EDIT: it is loading, it's failing later when trying to configure the first pipeline.
Hm so DMA scheduling is looked at more of as a legacy that is supported rather than the recommended way?
Hm so DMA scheduling is looked at more of as a legacy that is supported rather than the recommended way?
If the interface is slave to an external device, the DMA scheduling is required. When the interface is master and synchronous with the timer tick, switching the two is a revalidation effort but I don't see how the performance might differ on paper. But as @lyakh shows above, in practice there might be implementation issues.
Edit: to be clear, for Intel only the SSP can be slave to an external clock, the HDaudio, DMIC and SoundWire interfaces are all clock masters and the clocks are synchronous with the timers.
To recap: I've found out that all "legacy" platforms (BYT, CHT, BDW, etc.) use DMA scheduling. An attempt to switch byt-nocodec to timer scheduling failed with firmware errors, which I since then have tried to debug and fix.
I've found the reason why this doesn't work: the firmware DW DMA driver fails to set configuration in dw_dma_set_config()
with an error:
ERROR dw_dma_set_config(): dma 1 channel 1 not enough elems for config with irq disabled 1
This doesn't fail with DMA scheduling because then the .irq_disabled
flag isn't set and the driver then doesn't even check how many elements the configuration specifies.
This doesn't fail on non-legacy platforms, because they don't specify CONFIG_HOST_PTABLE
, then the COMP_ATTR_HOST_BUFFER
host / PCM attribute isn't set, so in create_local_elems()
the hd->config
SG array is allocated with buffer_count
(5) elements and not with 1.
I tried fixing the above problem by allocating the necessary minimum (3) of SG elements in create_local_elems()
and then also by changing DW_DMA_BUFFER_PERIOD_COUNT
for !CONFIG_HW_LLI
case to 3 too. That fixed two instances of DMA configuration failure but then the firmware failed later anyway.
So, it looks like "legacy platforms" have multiple problems with timer-driven scheduling. It might be our best option ATM to make this a hard rule somewhere and try to fix DMA scheduling which we need anyway.
I've found the reason why this doesn't work: the firmware DW DMA driver fails to set configuration in
dw_dma_set_config()
with an error:ERROR dw_dma_set_config(): dma 1 channel 1 not enough elems for config with irq disabled 1
This doesn't fail with DMA scheduling because then the
.irq_disabled
flag isn't set and the driver
This is a rule for when DMA uses HW LLI mode (since there is a race between writing back LL descriptors and resetting them when 2 periods are used). This rule should not apply for SW LLI on BYT.
In the BYT case the playback topology comprises of 3 pipelines: DAI+volume+mixer, and 2 PCM+volume pipelines, connecting to that mixer. Usually during the test first the playback starts, I see the scheduler scheduling the two playback pipelines one after another. Then a bit later the capture process starts too, from which point now 3 tasks are always scheduled one after another.
But sometimes when the capture starts, I see that the scheduler only tries to schedule the capture and the playback DAI tasks, and the PCM pipeline task is stuck in the RESCHEDULE state and not rescheduled any more.
Impressive set of findings @lyakh, I am so glad we have you dig into all this!
CI observed this issue again on BSW_CYN_MAX98090 and BYT_MB_NOCODEC.
http://sof-ci.sh.intel.com/#/result/planresultdetail/1441?model=BSW_CYN_MAX98090&testcase=simultaneous-playback-capture-50
http://sof-ci.sh.intel.com/#/result/planresultdetail/1441?model=BYT_MB_NOCODEC&testcase=simultaneous-playback-capture-50
@keqiaozhang is there a way we can bisect to see when the problem re-appeared?
@lgirdwood This issue remains visible in recent Intel daily tests, it's still a problem.
[241094039.427083] (241094032.000000) c0 buffer 3.18 src/audio/buffer.c:211 comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6 [241094042.916667] ( 3.489583) c0 buffer 1.5 src/audio/buffer.c:172 comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5 [241095038.958333] ( 996.041687) c0 buffer 3.18 src/audio/buffer.c:211 comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6 [241095043.697917] ( 4.739583) c0 buffer 1.5 src/audio/buffer.c:172 comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5
buffer warning 'no bytes to produce' and ' no bytes to consume' don't appear these days . but we can still get ' WARN dai_copy(): nothing to copy' from dai
example : inner dailytest 4705;model=BDW_WSB_RT286;testcase=multiple-pipeline-playback-50
We are still seeing this in recent daily report.
in inner daily 9751 and 9715, when check-playback/check-capture on BDW_WSR_RT286 , IO error happen in the first play
@XiaoyunWu6666 I'm suspicious we have over budget MCPS given that both are HiFi2 and will use the generic C processing with the frag API. @singalsu fyi - lets retest this again after all the frag APIs users have been fixed.
FWIW we seem to have an interrupt issue on Broadwell https://github.com/thesofproject/linux/issues/3400
Still happening in daily 10146?model=BDW_WSB_RT286&testcase=multiple-pipeline-capture-50
Start Time: 2022-02-11 22:27:26 UTC Kernel Branch: topic/sof-dev Kernel Commit: 98119478 SOF Branch: main SOF Commit: b8954754f055
As Broadwell (BDW) is a very old platform, we lower its priority and will not fix issues with multi-pipeline test cases on BDW.
The test still failed yesterday in 10402?model=BDW_WSB_RT286&testcase=multiple-pipeline-capture-50, let's close this when https://github.com/thesofproject/sof-test/pull/863 is merged so we stop testing this every day like someone is assigned to it.
We had only 6 distinct failures in 10402 and this was one of them.
@marc-hb I think we can close this again since https://github.com/intel-innersource/drivers.audio.ci.sof-framework/pull/185 got merged and also see current test 10444 [console]
test case multiple-pipeline-capture-50.sh is SKIP!
Catch ignore field of test-case: won't fix https://github.com/thesofproject/sof/issues/3170!
Describe the bug Input/output error when do simultaneous-playback-capture test After error occured all the pipeline can work
To Reproduce 1."sudo reboot" to reboot system 2.cd sof-test 3.cd test-case 4.export TPLG=sof-byt-nocodec.tplg 5../simultaneous-playback-capture.sh -l 100
Reproduction Rate 1 round: failed at 13/100
Expected behavior No error occured
Impact Input/output error when do simultaneous-playback-capture test of aplay(0,0) and arecord (0,0)
No error dmesg,no error sof error trace
Environment Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver). Kernel: {sof-dev fa7850de} SOF: {master:318dc9f7} Name of the topology file Topology: {sof-byt-nocodec.tplg } Name of the platform(s) on which the bug is observed. Platform: {BYT MB with nocodec}
dmesg0713.log sof-logger0713.log