thesofproject / linux

Linux kernel source tree
Other
91 stars 131 forks source link

6.10 kernel: i915 driver crash in system suspend due to wakeref from SOF code #5120

Open rfvirgil opened 2 months ago

rfvirgil commented 2 months ago

6.10 kernel UpXtreme TGL + CDB35L56-EIGHT-C (SoundWire) Play audio, then system-suspend while audio is still playing.

We're seeing that with the 6.10 kernel there is a WARN triggered in the i915 driver, intel_runtime_pm_driver_release() because there is a wakeref that has not been released. This happens if audio was playing when the system entered system-suspend. For example:

play audio.wav &
sudo rtcwake -m mem -s 5

I enabled the debugging in the i915 driver so that wakeref requests are tracked, and printed out the references when the WARN is triggered, and I get the reftracking info below (full dmesg attached dmesg.txt)

So, snd_sof has taken a wakeref on the i915 driver during probe, but has not released it when the system went to system-suspend. The i915 driver expects that all wakerefs have been released before system-suspending.

[   72.906688] PM: suspend entry (deep)
[   72.910341] Filesystems sync: 0.003 seconds
[   72.913001] Freezing user space processes
[   72.915545] Freezing user space processes completed (elapsed 0.002 seconds)
[   72.915613] OOM killer disabled.
[   72.915617] Freezing remaining freezable tasks
[   72.917124] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[   72.917333] printk: Suspending console(s) (use no_console_suspend to debug)
[   73.255773] cirrus-reset-owner 1013BAAD:00: Suspend
[   73.546490] ref_tracker: 0000:00:02.0@0000000000296b48 has 1/1 users at
                    __intel_runtime_pm_get+0x72/0xf0 [i915]
                    intel_display_power_get+0x27/0x70 [i915]
                    i915_audio_component_get_power+0x20/0x190 [i915]
                    snd_hdac_display_power+0x85/0x130 [snd_hda_core]
                    hda_codec_i915_init+0x3a/0x50 [snd_sof_intel_hda]
                    hda_dsp_probe_early+0x167/0x240 [snd_sof_intel_hda_generic]
                    snd_sof_device_probe+0x24b/0x370 [snd_sof]
                    sof_pci_probe+0x12f/0x200 [snd_sof_pci]
                    local_pci_probe+0x42/0xa0
                    pci_device_probe+0xc7/0x240
                    really_probe+0xd3/0x380
                    __driver_probe_device+0x78/0x150
                    driver_probe_device+0x1f/0x90
                    __driver_attach+0xd6/0x1d0
                    bus_for_each_dev+0x8c/0xe0
                    bus_add_driver+0x115/0x240

[   73.546596] ------------[ cut here ]------------
[   73.546599] i915 0000:00:02.0: i915 raw-wakerefs=1 wakelocks=1 on cleanup
[   73.546648] WARNING: CPU: 0 PID: 573 at drivers/gpu/drm/i915/intel_runtime_pm.c:447 intel_runtime_pm_driver_release+0x66/0x80 [i915]
rfvirgil commented 2 months ago

@plbossart FYI

plbossart commented 2 months ago

@rfvirgil can you clarify which endpoint was used? I would guess it's one of the HDMI endpoints but the description refers to the CS board.

I am afraid I don't fully get the i915 flows, clearly we don't do anything suspending, hda_codec_i915_display_power() is only called in the resume sequence.

this will most likely have to wait for @kv2019i to look at this...

plbossart commented 2 months ago

@ujfalusi FYI since you are in the middle of HDMI debugging.

rfvirgil commented 2 months ago

@plbossart What do you mean by "endpoint" ? We're playing audio to the CS35L56 on the EIGHT-C board. HDMI is connected to a monitor for console display, but we aren't using it for audio. Also, we aren't using graphics. The UpX doesn't have X or Wayland installed.

As a workaround I've set snd_hda_core.gpu_bind=0

I did start looking at the way the wakeref was implemented in i915 and snd-hda, but it was too complex for me to understand...

I've found hdac_hdmi_runtime_suspend(), which seems to be releasing the wakeref when it runtime-suspends. I'm guessing that the hdac code isn't handling the case where you system-suspend while still playing audio, so you were not runtime-suspended.

ujfalusi commented 2 months ago

I cannot reproduce it on upx (running in HDA mode, nothing fancy connected), neither on SDW device which have HDMI (and panel, which might change things). I'm not sure what can cause this atm

plbossart commented 2 months ago

I've found hdac_hdmi_runtime_suspend(), which seems to be releasing the wakeref when it runtime-suspends. I'm guessing that the hdac code isn't handling the case where you system-suspend while still playing audio, so you were not runtime-suspended.

the system suspend in hdac_hdmi.c seems to rely on pm_runtime_force_suspend(). I am not sure what it does and why this would matter anyways if you are not playing audio on HDMI. the entire i915 audio path should be disabled and powered-down.

Unless there's the silly keep-alive function that keeps HDMI running?

ujfalusi commented 2 months ago

Unless there's the silly keep-alive function that keeps HDMI running?

Hrm, I have the silent scream stream disabled on all of my devices...

ujfalusi commented 2 months ago

I cannot say what is going on...

On UPX-i11 HDA: it is OK under all combination

With SDW device (MTL): gives i915 raw-wakerefs=1 wakelocks=1 on cleanup under any combination, regardless of silent steam or playback

With SDW device (TGL-H): it is OK under all combination

:shrug:

marc-hb commented 2 months ago

Unless there's the silly keep-alive function that keeps HDMI running?

In a totally different i915 issue I recently discovered that DRM is polling for monitor hotplug by default even when everything is off. It's configured by /sys/module/drm_kms_helper/parameters/poll/intel_dp_detect

You can see it with options drm debug=0x2, example:

Jul 17 16:20:59 up2 kernel: i915 0000:00:02.0: [drm:intel_runtime_resume [i915]] Resuming device
Jul 17 16:20:59 up2 kernel: i915 0000:00:02.0: [drm:intel_runtime_resume [i915]] Device resumed
Jul 17 16:21:11 up2 kernel: i915 0000:00:02.0: [drm:intel_runtime_suspend [i915]] Suspending device
Jul 17 16:21:11 up2 kernel: i915 0000:00:02.0: [drm:intel_runtime_suspend [i915]] Device suspended
Jul 17 16:21:21 up2 kernel: i915 0000:00:02.0: [drm:intel_runtime_resume [i915]] Resuming device
Jul 17 16:21:21 up2 kernel: i915 0000:00:02.0: [drm:intel_runtime_resume [i915]] Device resumed

Sorry if off-topic, I thought it was worth sharing.