Closed ujfalusi closed 1 month ago
Yes, with this PR the PAUSE is going to be mostly disabled for Intel platforms regardless of IPC version.
I am not aligned with the approach, sorry.
I agree that PAUSE is not a release blocker. but in practice it's the only way to trigger quick transitions and find bugs we wouldn't see otherwise.
It's also incorrect to say it's not been tested. What has not been tested is suspend while paused which is a second-order corner case.
The net effect of this topology token would be to COMPLETELY disable PAUSE, including in our CI tests. That's different to disabling it for releases, and making an educated decision to avoid spending cycles on problematic issues with no ROI.
The topology approach has zero flexibiliity, it's all or nothing.
I wouldn't mind if this topology-based approach was contingent on a 'default n' kernel Kconfig, which would only be used by developers and CI tests. I am not ok with completely disabling this PAUSE capability for tests.
Typoe in title "enabeld"
Patch looks good but I wonder if this is ok to change the behaviour for existing systems -- this will go all the way back to CML/CFL systems (and not just Intel).
Right now the suspend while paused is causing the system to hard freeze. I have no idea when this regression happened, but it shows the extent to testing and use of pause/resume. Sure, this is a corner case, but I have a faint recollection that it worked. The only solution so far for the freeze is my https://github.com/thesofproject/linux/pull/5040, for both IPC3 and IPC4.
It's possible somebody is happily using SOF on a system with apps that use pause, and this kernel PR could break their use-case when they upgrade the kernel.
Yes, everything is a possibility, true.
Staged introduction (or disable_pause token in tplg) would be more incremental.
Initially I wanted to have opt-out for pause/resume but https://github.com/thesofproject/linux/issues/5035 shows that this feature has not been tested by CI and users to gave any confidence that it is working.
Changes since v1:
force_pause_support
module parameter (to hda-pcm.c) to ignore the pasue_supported flag from topology and to keep the pause enabled. Iow, if the force_pause_support
is set to true
then there is no change in behavior @fredoh9 @marc-hb @ssavati can we add this to the CI configuration before merging this:
options snd_sof_intel_hda_common force_pause_support=true
Changes since v2:
hda_
prefix for the module parameter name (no idea how it keeps coming back - removed like four times while playing with it...)but https://github.com/thesofproject/linux/issues/5035 shows that this feature has not been tested by CI and users to gave any confidence that it is working.
Changes since v3:
Changes since v4:
Right now the suspend while paused is causing the system to hard freeze. I have no idea when this regression happened, but it shows the extent to testing and use of pause/resume. Sure, this is a corner case, but I have a faint recollection that it worked.
This does not "show" or prove anything: it's only a corner case that regular users are incredibly unlikely to hit - much less likely than regular "pause" which is already unused.
We actually have a test script for this corner case but it's never been working properly:
On the other hand, regular pause/release (without suspend) has been regularly tested since forever and it has been mostly passing. There were some frequent false positives caused by bugs in the test code but I fixed all that a few days ago in rewrite https://github.com/thesofproject/sof-test/pull/1218 and pause/release testing has been rock solid since.
I agree that PAUSE is not a release blocker. but in practice it's the only way to trigger quick transitions and find bugs we wouldn't see otherwise.
Indeed: since my test code rewrite in 1218, only LNL has been failing consistently. Guess what: LNL is the only platform where we ALSO have other, persistent and hard to fix failures - but with much lower reproduction rates. Coincidence? I don't think so.
It's very common (and very useful) to stress test systems past supported use cases to merely increase reproduction rates of issues also happening in regular use cases and this has what pause/release has been successfully doing for a long time - and even better now.
BTW what is the aplay/arecord
user interface expectation after this is merged? I mean for regular users, not for CI or experts with force_pause_support=true
or something.
https://sof-ci.01.org/linuxpr/PR5041/build4054/devicetest/index.html?model=GLK_BOB_DA7219-ipc3&testcase=check-pause-resume-playback-10 and others show this:
t=150 ms: aplay: (0/10) Found volume ### | xx%, recording for 178 ms
t=234 ms: aplay: (0/10) Found === PAUSE === , pausing for 195 ms
t=404 ms: aplay: (1/10) Found volume ### | xx%, recording for 161 ms
t=404 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
t=575 ms: aplay: (1/10) Found === PAUSE === , pausing for 128 ms
t=746 ms: aplay: (2/10) Found volume ### | xx%, recording for 151 ms
t=746 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
t=831 ms: aplay: (2/10) Found === PAUSE === , pausing for 186 ms
t=1002 ms: aplay: (3/10) Found volume ### | xx%, recording for 179 ms
t=1002 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
t=1258 ms: aplay: (3/10) Found === PAUSE === , pausing for 124 ms
t=1343 ms: aplay: (4/10) Found volume ### | xx%, recording for 150 ms
t=1343 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
t=1514 ms: aplay: (4/10) Found === PAUSE === , pausing for 175 ms
t=1684 ms: aplay: (5/10) Found volume ### | xx%, recording for 111 ms
t=1684 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
...
This is really weird: why does aplay/arecord
display PAUSE when there is no PAUSE support?
Maybe ALSA userspace should be involved in this too?
@marc-hb, something is not quite right in the expect script in case-lib/apause.exp, not sure what to be honest.
It is pretty interesting what aplay/arecord does. Pause is supported
SPACE
it will print === PAUSED ===
with the cursor moved to column 71:
After releasing the pause the === PAUSED ===
remain, but the cursor appears to be at the start of the line:
=== PAUSED ===
replaces the volume barPause is not supported
SPACE
will result a print and a new line (PAUSE command ignored (no hw support)
)
But it certainly not printing === PAUSED ===
, so the script is to blame for the false report.
@marc-hb, I think I have found a plausible reason...
aplay prints out the === PAUSE ===
unconditionally before trying to pause. If pause is not supported then it will write PAUSE command ignored (no hw support)\n
which will replace the previously printed === PAUSE ===
expect might be catching this.
https://github.com/alsa-project/alsa-utils/blob/master/aplay/aplay.c#L1645
https://github.com/alsa-project/alsa-utils/blob/master/aplay/aplay.c#L1610
So the expect script is OK.
In every cases it is using the same line
Yes, aplay uses \r
instead of \n
to constantly overwrite its previous output and not scroll. The script should be ready for that. If you think it's not please file an new sof-test bug and I'll try to fix ASAP.
Generally speaking, Expect (or anything reading from a pipe) has no concept of "overwriting" or "scrolling"; only terminals do. Other programs reading from a pipe only see a stream of characters, some of them happening to be \r
and \n
.
aplay prints out the
=== PAUSE ===
unconditionally before trying to pause. If pause is not supported then it will writePAUSE command ignored (no hw support)
which will replace the previously printed === PAUSE === expect might be catching this.
Thanks for the investigation!
So the expect script is OK.
The script is unfortunately not OK because it trusts the first, unconditional (and TBH: wrong) === PAUSE ===
message and changes its state machine accordingly. And then it gets very confused.
Not sure how to fix that. Is there some sort of more convenient API to detect the lack of pause support? And then abort/skip the test immediately instead of wasting a lot of time running pointlessly and returning misleading results. I mean more convenient than this very misleading aplay output.
The script is unfortunately not OK because it trusts the first, unconditional (and TBH: wrong)
=== PAUSE ===
message and changes its state machine accordingly. And then it gets very confused.Not sure how to fix that. Is there some sort of more convenient API to detect the lack of pause support? And then abort/skip the test immediately instead of wasting a lot of time running pointlessly and returning misleading results. I mean more convenient than this very misleading aplay output.
I'm not sure about other ways to check runtime info flag of the PCM, but that would be ideal
This might avoid confusing the script (might need to handle the not supported case and abort?): https://github.com/alsa-project/alsa-utils/pull/271
I'm not sure about other ways to check runtime info flag of the PCM, but that would be ideal
Anyone?
This might avoid confusing the script (might need to handle the not supported case and abort?): https://github.com/alsa-project/alsa-utils/pull/271
Very useful but it won't help systems and people who don't update ALSA user space regularly.
The script is unfortunately not OK because it trusts the first, unconditional (and TBH: wrong) === PAUSE === message and changes its state machine accordingly. And then it gets very confused. Not sure how to fix that.
Wait: it shouldn't be too hard to catch PAUSE command ignored
and fail or skip even after entering a "confused" state. It only needs to happen before the confusion leads to another, earlier and misleading failure...
@ujfalusi @marc-hb I am not able to follow what the thread is about. Is our CI able to deal with this or not?
The agreement was that the pause support would be forced on all CI devices, so what is the issue? We don't really care about the case where the pause is NOT supported, because it's not really our problem is it?
What I don't get is whether all CI devices are ready for this PR to be merged, adding a kernel parameter requires some changes in the setup.
We don't really care about the case where the pause is NOT supported, because it's not really our problem is it?
It's not our problem in theory. In practice, we still don't have consistent device configuration.
Is our CI able to deal with this or not?
It will be after small sof-test fix https://github.com/thesofproject/sof-test/pull/1226 is merged (DONE). That fix will let us tell the difference between misconfigured devices versus well configured devices actually failing the test. @ujfalusi please review.
SOFCI TEST
Changes since v3:
@ujfalusi I have added below flag in all CI devices
options snd_sof_intel_hda_common force_pause_support=true
@ssavati, I have changed the parameter from bool to int, so it should be
options snd_sof_intel_hda_common force_pause_support=1
Sorry, it was a result of a popular vote:
force_pause_support=-1
(default) don't force pause support, use what the system woudl use
force_pause_support=0
- disable the pause support
force_pause_support=1
- Enable the pause support when it can be (forced DMI_L1 and Capture streams cannot have pause enabled, but this also needs module parameter to hit)
SOFCI TEST
SOFCI TEST
Hi,
new token set is added by https://github.com/thesofproject/sof/pull/9202 to indicate that the pause operation is supported or not on the given PCM device.
Pause is an optional feature that depends on pipeline, topology and modules running within that pipeline. Lets have a marker to help userspace and kernel know which pipelines should advertise pause.
Use this information in hda-pcm.c to disable the pause support or keep it enabled.