thesofproject / sof

Sound Open Firmware
Other
570 stars 322 forks source link

[BUG] [TGL] rate error > 200 by alsa_conformance_test for PCM99 capture #4617

Open johnylin76 opened 3 years ago

johnylin76 commented 3 years ago

Describe the bug The rate error > 200 for capture device hw:0,99 measured by alsa_conformance_test for firmware w/ IGO

To Reproduce Run the following command and check the rate error:

alsa_conformance_test -Chw:0,99 -c4 -fS32_LE

Reproduction Rate How often does the issue happen ? always

Expected behavior The rate error should be low (e.g. < 1 measured on other TGL board w/o IGO)

Impact The performance of audio would drop

Environment tgl-13 branch with IGO patches & IGO binary Issue still reproduced with PR #4575

Screenshots or console output ---------PRINT PARAMS--------- PCM name: hw:0,99 card: sofrt5682 [sof-rt5682] device: DMIC (*) [] stream: CAPTURE merge_threshold_t: 0.000100 merge_threshold_sz: 240 access type: MMAP_INTERLEAVED format: S32_LE channels: 4 rate: 48000 fps period time: 5333 us period size: 256 frames buffer time: 85333 us buffer size: 4096 frames ---------TIMER RESULT--------- Total_time(s) Counts Averages(s) snd_pcm_open 0.199339453 2 0.099670 snd_pcm_hw_params 0.004043824 2 0.002022 snd_pcm_hw_params_any 0.000015971 2 0.000008 snd_pcm_sw_params 0.000001639 2 0.000001 snd_pcm_prepare 0.000029138 2 0.000015 snd_pcm_start 0.000916965 2 0.000458 snd_pcm_avail 2.023577778 77075 0.000026 precision: 0.000000001 ----------RUN RESULT---------- number of recorders: 1 number of points: 153 step average: 316.394737 step min: 8 step max: 768 step median: 284 step standard deviation: 195.473385 rate: 47906.655167 rate error: 285.501755 zero channels: 0 0 0 0 number of underrun: 0 number of overrun: 0

aiChaoSONG commented 3 years ago

@johnylin76 Could you please describe the test procedures, and how the rate is calculated? it will help us to understand more about your test case.

kv2019i commented 3 years ago

FYI @keyonjie @lgirdwood I think this matces with the theory that some processing is blocked (on other cores) when IGO is running. We are getting very uneven updates for dmic capture but only with IGO image, and this is picked by alsa_conformance_test.

@aiChaoSONG This is the standard alsa-conformance test in Chrome https://chromium.googlesource.com/chromiumos/platform/audiotest/+/refs/heads/master/alsa_conformance_test/

lgirdwood commented 3 years ago

@kv2019i ack - blocking IDC is not good. @aiChaoSONG are you able to reproduce with 98e8016de02ceb8637857148a348ab4ab42bfe7e locally applied ?

aiChaoSONG commented 3 years ago

@lgirdwood I don't have a device to test, let me contact Keyon for more info.

1994lwz commented 3 years ago

This issue can be reproduce on the TGL-CHROME with IGO patches & IGO binary. Check other pipeline, and only reproduce on the hw:0,99.

mengdonglin commented 3 years ago

@1994lwz Thank you for confirmation! Can you please share the sof-log output while running alsa_conformance_test on this pipeline? The log will be used to analyze performance.

1994lwz commented 3 years ago

@1994lwz Thank you for confirmation! Can you please share the sof-log output while running alsa_conformance_test on this pipeline? The log will be used to analyze performance.

add the sof-logger and dmesg file as below. sof-logger.txt dmesg.txt

singalsu commented 3 years ago

@johnylin76 The proper way of testing sample rate accuracy is to play a sine tone e.g. 997 Hz with a calibrated signal generator and then analyze the captured audio file mathematically (e.g. RMS error minize fit an ideal sine for phase, frequency, amplitude).

Pure software test running on DUT without external equipment is very challenging due to lack of trusted time base in DUT and jitter of ALSA events. ALSA jitter can be compensated with a very long test (e.g. minute?), but the lack of absolute time base still remains an issue. Services like NTP address only very long time system clock drift.

Edit: Wanted to add that the PDM microphones clock is generated with 1/N divider (not a M/N) from a crystal oscillator that should be accurate. If there are no xruns, glitches seen in captured audio and the processing (IGO) does not "stretch" the audio signal the audio rate should not drift vs. the platform crystal frequency.

lgirdwood commented 3 years ago

@singalsu dont we have WOV buffer in the PCM99 pipeline ? Could the position readback be jittery in this case as the DMIC would underrun if the rate was too slow ??

cujomalainey commented 2 years ago

I think I might have a guess about what might be part of the issue here. Our internal scheduling period is completely disconnected from what the pipeline advertises as the period to userspace. Therefore our tools cannot make informed decisions about the actual scheduling rate because the capabilities are just not even close to what is happening inside the DSP.

lgirdwood commented 2 years ago

I think I might have a guess about what might be part of the issue here. Our internal scheduling period is completely disconnected from what the pipeline advertises as the period to userspace. Therefore our tools cannot make informed decisions about the actual scheduling rate because the capabilities are just not even close to what is happening inside the DSP.

Sounds like this information should be taken from topology for each PCM and made available to userspace ? Not sure if the ALSA PCM info APIs can do this today ? @ujfalusi I assume its easy enough to get the PCM scheduling period from topology but also assume no mechanism for userspace to get it today ?

cujomalainey commented 2 years ago

@lgirdwood both pieces are in the topology, just one is tied to the pipeline PCM which is fed to the DSP and the other is part of the PCM CAPS which is a different section entirely which is used to populate ALSA APIs.

kv2019i commented 2 years ago

@cujomalainey @lgirdwood This reminds me of an older issue where we hit grey areas of ALSA PCM API semantics with the conformance test: https://github.com/thesofproject/linux/pull/2381

We have avail/delay reporting in the ALSA PCM interface, but the static description of buffering is fairly limited (there's "fifo_size" on top of period/buffer_size but that's about it).

OTOH, not sure how much is gained by a very elaborate description of the internal buffering characteristics. With sufficiently long test run, the rate of data should still fall into the expected no matter how bursty DSP is. And adding an algorithm to the pipe should not change this. And for really accurate rate test, it needs to be done like @singalsu describes above.

cujomalainey commented 1 year ago

@kv2019i its not great for our test not being able to check what is the expected DMA period. It definitely breaks alsa_conformance rate error

lgirdwood commented 1 year ago

@cujomalainey could the tools make a guess based on some bulk reading of LPIB vs system time during stream playback ? e.g. it could make a guess based on average DMA consumption/production in step size (of period) over time @plbossart any thoughts ?

cujomalainey commented 1 year ago

@poloo5582 ping as you know more about this test and can keep me honest

@lgirdwood the test attempts to use what the PCM advertises as a target and its measuring the consumption over time to make sure its honest.

poloo5582 commented 1 year ago

This tool calls snd_pcm_avail frequently to see how fast that driver consumes samples (based on system time). It's the same as what CRAS(ChromeOS audio server) does so it can help to make sure the audio is stable on ChromeOS.

In addition, we just have done some update on the alsa_conformance_test to deal with cases that a driver consumes samples several times in a short period. I think it may help with this issue. Maybe you can try to run this test again to see whether it can be passed. If not, please attach the result with "--debug" params here. Thanks!

lgirdwood commented 1 year ago

@cujomalainey IIUC, avail() will read the HW DMA position (which should be updated by DSP DMAC within the last 1ms). @plbossart can you confirm, we should not be using jittery IPC position here.

plbossart commented 1 year ago

@cujomalainey @lgirdwood we only use DPIB registers to report the DMA position changes. But the twist is that the DMA position changes happen at the request of the firmware, and hence the granularity of DMA position changes will depend on the firmware tick. With the recent change to use 10ms, we just shot ourselves in the foot in that the position will change by steps of 10ms. I told you not to do it :-)

cujomalainey commented 1 year ago

@plbossart I fail to understand how we shot ourselves in the foot if we fix the fact that the kernel is not advertising the correct period.