thesofproject / linux

Linux kernel source tree
Other
91 stars 133 forks source link

ASoC: SOF: core/Intel: Follow the pause_supported flag from topology to disable or keep the pause support enabled #5041

Closed ujfalusi closed 1 month ago

ujfalusi commented 5 months ago

Hi,

new token set is added by https://github.com/thesofproject/sof/pull/9202 to indicate that the pause operation is supported or not on the given PCM device.

Pause is an optional feature that depends on pipeline, topology and modules running within that pipeline. Lets have a marker to help userspace and kernel know which pipelines should advertise pause.

Use this information in hda-pcm.c to disable the pause support or keep it enabled.

ujfalusi commented 5 months ago

Yes, with this PR the PAUSE is going to be mostly disabled for Intel platforms regardless of IPC version.

plbossart commented 5 months ago

I am not aligned with the approach, sorry.

I agree that PAUSE is not a release blocker. but in practice it's the only way to trigger quick transitions and find bugs we wouldn't see otherwise.

It's also incorrect to say it's not been tested. What has not been tested is suspend while paused which is a second-order corner case.

The net effect of this topology token would be to COMPLETELY disable PAUSE, including in our CI tests. That's different to disabling it for releases, and making an educated decision to avoid spending cycles on problematic issues with no ROI.

The topology approach has zero flexibiliity, it's all or nothing.

I wouldn't mind if this topology-based approach was contingent on a 'default n' kernel Kconfig, which would only be used by developers and CI tests. I am not ok with completely disabling this PAUSE capability for tests.

ujfalusi commented 5 months ago

Typoe in title "enabeld"

Patch looks good but I wonder if this is ok to change the behaviour for existing systems -- this will go all the way back to CML/CFL systems (and not just Intel).

Right now the suspend while paused is causing the system to hard freeze. I have no idea when this regression happened, but it shows the extent to testing and use of pause/resume. Sure, this is a corner case, but I have a faint recollection that it worked. The only solution so far for the freeze is my https://github.com/thesofproject/linux/pull/5040, for both IPC3 and IPC4.

It's possible somebody is happily using SOF on a system with apps that use pause, and this kernel PR could break their use-case when they upgrade the kernel.

Yes, everything is a possibility, true.

Staged introduction (or disable_pause token in tplg) would be more incremental.

Initially I wanted to have opt-out for pause/resume but https://github.com/thesofproject/linux/issues/5035 shows that this feature has not been tested by CI and users to gave any confidence that it is working.

ujfalusi commented 5 months ago

Changes since v1:

plbossart commented 5 months ago

@fredoh9 @marc-hb @ssavati can we add this to the CI configuration before merging this:

options snd_sof_intel_hda_common force_pause_support=true
ujfalusi commented 5 months ago

Changes since v2:

marc-hb commented 5 months ago

but https://github.com/thesofproject/linux/issues/5035 shows that this feature has not been tested by CI and users to gave any confidence that it is working.

5035 is a pretty specific corner case. Simpler pause/resume cases have been regularly tested for a long time with very low failure rates.

ujfalusi commented 4 months ago

Changes since v3:

ujfalusi commented 4 months ago

Changes since v4:

marc-hb commented 4 months ago

Right now the suspend while paused is causing the system to hard freeze. I have no idea when this regression happened, but it shows the extent to testing and use of pause/resume. Sure, this is a corner case, but I have a faint recollection that it worked.

This does not "show" or prove anything: it's only a corner case that regular users are incredibly unlikely to hit - much less likely than regular "pause" which is already unused.

We actually have a test script for this corner case but it's never been working properly:

On the other hand, regular pause/release (without suspend) has been regularly tested since forever and it has been mostly passing. There were some frequent false positives caused by bugs in the test code but I fixed all that a few days ago in rewrite https://github.com/thesofproject/sof-test/pull/1218 and pause/release testing has been rock solid since.

I agree that PAUSE is not a release blocker. but in practice it's the only way to trigger quick transitions and find bugs we wouldn't see otherwise.

Indeed: since my test code rewrite in 1218, only LNL has been failing consistently. Guess what: LNL is the only platform where we ALSO have other, persistent and hard to fix failures - but with much lower reproduction rates. Coincidence? I don't think so.

It's very common (and very useful) to stress test systems past supported use cases to merely increase reproduction rates of issues also happening in regular use cases and this has what pause/release has been successfully doing for a long time - and even better now.

marc-hb commented 4 months ago

BTW what is the aplay/arecord user interface expectation after this is merged? I mean for regular users, not for CI or experts with force_pause_support=true or something.

https://sof-ci.01.org/linuxpr/PR5041/build4054/devicetest/index.html?model=GLK_BOB_DA7219-ipc3&testcase=check-pause-resume-playback-10 and others show this:

t=150 ms: aplay: (0/10) Found volume ### | xx%, recording for 178 ms
t=234 ms: aplay: (0/10) Found   === PAUSE ===  ,  pausing for 195 ms
t=404 ms: aplay: (1/10) Found volume ### | xx%, recording for 161 ms
t=404 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
t=575 ms: aplay: (1/10) Found   === PAUSE ===  ,  pausing for 128 ms
t=746 ms: aplay: (2/10) Found volume ### | xx%, recording for 151 ms
t=746 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
t=831 ms: aplay: (2/10) Found   === PAUSE ===  ,  pausing for 186 ms
t=1002 ms: aplay: (3/10) Found volume ### | xx%, recording for 179 ms
t=1002 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
t=1258 ms: aplay: (3/10) Found   === PAUSE ===  ,  pausing for 124 ms
t=1343 ms: aplay: (4/10) Found volume ### | xx%, recording for 150 ms
t=1343 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
t=1514 ms: aplay: (4/10) Found   === PAUSE ===  ,  pausing for 175 ms
t=1684 ms: aplay: (5/10) Found volume ### | xx%, recording for 111 ms
t=1684 ms: aplay: WARNING: received == PAUSE == while in state recording! Ignoring.
...

This is really weird: why does aplay/arecord display PAUSE when there is no PAUSE support?

Maybe ALSA userspace should be involved in this too?

ujfalusi commented 4 months ago

@marc-hb, something is not quite right in the expect script in case-lib/apause.exp, not sure what to be honest.

It is pretty interesting what aplay/arecord does. Pause is supported

Pause is not supported

But it certainly not printing === PAUSED ===, so the script is to blame for the false report.

ujfalusi commented 4 months ago

@marc-hb, I think I have found a plausible reason... aplay prints out the === PAUSE === unconditionally before trying to pause. If pause is not supported then it will write PAUSE command ignored (no hw support)\n which will replace the previously printed === PAUSE === expect might be catching this. https://github.com/alsa-project/alsa-utils/blob/master/aplay/aplay.c#L1645 https://github.com/alsa-project/alsa-utils/blob/master/aplay/aplay.c#L1610

So the expect script is OK.

marc-hb commented 4 months ago

In every cases it is using the same line

Yes, aplay uses \r instead of \n to constantly overwrite its previous output and not scroll. The script should be ready for that. If you think it's not please file an new sof-test bug and I'll try to fix ASAP.

Generally speaking, Expect (or anything reading from a pipe) has no concept of "overwriting" or "scrolling"; only terminals do. Other programs reading from a pipe only see a stream of characters, some of them happening to be \r and \n.

aplay prints out the === PAUSE === unconditionally before trying to pause. If pause is not supported then it will write PAUSE command ignored (no hw support) which will replace the previously printed === PAUSE === expect might be catching this.

Thanks for the investigation!

So the expect script is OK.

The script is unfortunately not OK because it trusts the first, unconditional (and TBH: wrong) === PAUSE === message and changes its state machine accordingly. And then it gets very confused.

Not sure how to fix that. Is there some sort of more convenient API to detect the lack of pause support? And then abort/skip the test immediately instead of wasting a lot of time running pointlessly and returning misleading results. I mean more convenient than this very misleading aplay output.

ujfalusi commented 4 months ago

The script is unfortunately not OK because it trusts the first, unconditional (and TBH: wrong) === PAUSE === message and changes its state machine accordingly. And then it gets very confused.

Not sure how to fix that. Is there some sort of more convenient API to detect the lack of pause support? And then abort/skip the test immediately instead of wasting a lot of time running pointlessly and returning misleading results. I mean more convenient than this very misleading aplay output.

I'm not sure about other ways to check runtime info flag of the PCM, but that would be ideal

This might avoid confusing the script (might need to handle the not supported case and abort?): https://github.com/alsa-project/alsa-utils/pull/271

marc-hb commented 4 months ago

I'm not sure about other ways to check runtime info flag of the PCM, but that would be ideal

Anyone?

This might avoid confusing the script (might need to handle the not supported case and abort?): https://github.com/alsa-project/alsa-utils/pull/271

Very useful but it won't help systems and people who don't update ALSA user space regularly.

The script is unfortunately not OK because it trusts the first, unconditional (and TBH: wrong) === PAUSE === message and changes its state machine accordingly. And then it gets very confused. Not sure how to fix that.

Wait: it shouldn't be too hard to catch PAUSE command ignored and fail or skip even after entering a "confused" state. It only needs to happen before the confusion leads to another, earlier and misleading failure...

plbossart commented 4 months ago

@ujfalusi @marc-hb I am not able to follow what the thread is about. Is our CI able to deal with this or not?

The agreement was that the pause support would be forced on all CI devices, so what is the issue? We don't really care about the case where the pause is NOT supported, because it's not really our problem is it?

What I don't get is whether all CI devices are ready for this PR to be merged, adding a kernel parameter requires some changes in the setup.

marc-hb commented 4 months ago

We don't really care about the case where the pause is NOT supported, because it's not really our problem is it?

It's not our problem in theory. In practice, we still don't have consistent device configuration.

marc-hb commented 4 months ago

Is our CI able to deal with this or not?

It will be after small sof-test fix https://github.com/thesofproject/sof-test/pull/1226 is merged (DONE). That fix will let us tell the difference between misconfigured devices versus well configured devices actually failing the test. @ujfalusi please review.

marc-hb commented 3 months ago

SOFCI TEST

ujfalusi commented 1 month ago

Changes since v3:

ssavati commented 1 month ago

@ujfalusi I have added below flag in all CI devices options snd_sof_intel_hda_common force_pause_support=true

@ssavati, I have changed the parameter from bool to int, so it should be options snd_sof_intel_hda_common force_pause_support=1

Sorry, it was a result of a popular vote: force_pause_support=-1(default) don't force pause support, use what the system woudl use force_pause_support=0 - disable the pause support force_pause_support=1 - Enable the pause support when it can be (forced DMI_L1 and Capture streams cannot have pause enabled, but this also needs module parameter to hit)

ujfalusi commented 1 month ago

SOFCI TEST

lgirdwood commented 1 month ago

SOFCI TEST