thesofproject / linux

Linux kernel source tree
Other
91 stars 131 forks source link

[BUG] ADL: deadlock (?) after warm reboot, error sof_probe_work [snd_sof] #4663

Open kovalev0 opened 1 year ago

kovalev0 commented 1 year ago

Describe the bug After a warm boot, the laptop will experience CPU soft lockups:

[ 32.479637] watchdog: BUG: soft lockup - CPU#3 stuck for 27s! [kworker/3:1:123] ... [ 32.479704] raid6_pq [ 32.479706] CPU: 3 PID: 123 Comm: kworker/3:1 Not tainted 6.5.8-std-def-alt1.1 thesofproject/sof#1 [ 32.479707] Hardware name: HP HP 250 15.6 inch G9 Notebook PC/89A0, BIOS F.57 09/19/2022 [ 32.479708] Workqueue: events sof_probe_work [snd_sof] [ 32.479721] RIP: 0010:_raw_spin_unlock_irq+0x25/0x60 [ 32.479725] Code: 90 90 90 90 90 0f 1f 44 00 00 48 83 3d 53 00 f4 00 00 74 42 c6 07 00 0f 1f 00 48 83 3d e3 fe f3 00 00 74 34 fb 0f 1f 44 00 00 <65> ff 0d 1c 47 32 4e 74 0f 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc [ 32.479726] RSP: 0018:ffff9b94406d7b48 EFLAGS: 00000282 [ 32.479727] RAX: 0000000000000007 RBX: ffff8f72c3648830 RCX: ffff8f72c3648840 [ 32.479728] RDX: 0000000000000007 RSI: ffff9b9440714000 RDI: ffff8f72c3648cd8 [ 32.479728] RBP: 00000000401f0005 R08: 0000000000000000 R09: 0000000000000000 [ 32.479729] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 [ 32.479729] R13: ffff8f72c3648cd8 R14: 0000000000000001 R15: ffff8f72c3648830 [ 32.479730] FS: 0000000000000000(0000) GS:ffff8f765f4c0000(0000) knlGS:0000000000000000 [ 32.479731] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 32.479731] CR2: 00007f6ce560d4e2 CR3: 00000002c9c34000 CR4: 0000000000750ee0 [ 32.479732] PKRU: 55555554 [ 32.479733] Call Trace: [ 32.479734] \ [ 32.479737] ? watchdog_timer_fn+0x1ce/0x230 [ 32.479739] ? __pfx_watchdog_timer_fn+0x10/0x10 [ 32.479741] ? __hrtimer_run_queues+0x10f/0x2b0 [ 32.479743] ? hrtimer_interrupt+0xf4/0x240 [ 32.479745] ? __sysvec_apic_timer_interrupt+0x5a/0x130 [ 32.479746] ? sysvec_apic_timer_interrupt+0x69/0x90 [ 32.479748] <\/IRQ> [ 32.479748] \ [ 32.479749] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 [ 32.479751] ? _raw_spin_unlock_irq+0x25/0x60 [ 32.479753] snd_hdac_bus_send_cmd+0xa2/0xc0 [snd_hda_core] [ 32.479762] snd_hdac_bus_exec_verb_unlocked+0x7c/0x1b0 [snd_hda_core] [ 32.479768] snd_hdac_bus_exec_verb+0x37/0x60 [snd_hda_core] [ 32.479774] hda_reg_read+0x1b7/0x250 [snd_hda_core] [ 32.479781] __snd_hdac_regmap_read_raw+0x8c/0x100 [snd_hda_core] [ 32.479788] snd_hdac_device_init+0x20b/0x430 [snd_hda_core] [ 32.479796] snd_hda_codec_device_init+0xad/0x270 [snd_hda_codec] [ 32.479806] hda_codec_probe_bus+0x16e/0x2e0 [snd_sof_intel_hda] [ 32.479810] hda_dsp_probe+0x3de/0x7c0 [snd_sof_intel_hda_common] [ 32.479819] sof_probe_work+0x2c/0x430 [snd_sof] [ 32.479828] process_one_work+0x22b/0x440 [ 32.479830] worker_thread+0x4a/0x3c0 [ 32.479832] ? __pfx_worker_thread+0x10/0x10 [ 32.479833] kthread+0xe1/0x110 [ 32.479835] ? __pfx_kthread+0x10/0x10 [ 32.479837] ret_from_fork+0x2d/0x50 [ 32.479839] ? __pfx_kthread+0x10/0x10 [ 32.479840] ret_from_fork_asm+0x1b/0x30 [ 32.479843] <\/TASK>

See: https://bugzilla.kernel.org/show_bug.cgi?id=217867 https://bugzilla.altlinux.org/48171

When the sound is turned on for the first time [1], it works properly, after a reboot [2] - sound devices are unavailable and the system freezes when the system is turned off.

[1] journalctl-k-b_-1_OK.txt alsa-info_ok.txt

[2] journalctl-k-b_ERR.txt alsa-info_err.txt

firmware and topology files are used the latest: sof-adl.ri sof-hda-generic-2ch.tplg

To Reproduce latest kernels 6.1.57 and 6.5.8

If you manually unload the module every time before restarting, # modprobe -r snd_sof_pci_intel_tgl then from the moment the system is first turned on, the error is reproduced on the 3rd - 5th boot.

If you do not unload the module, then it is stable on each next system boot

Environment 1) Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).

Screenshots or console output dmesg_ERR.txt

sof-dyndbg.conf ``` # ACPI options snd_sof_acpi dyndbg=+pmf options snd_sof_acpi_intel_byt dyndbg=+pmf options snd_sof_acpi_intel_bdw dyndbg=+pmf options snd_sof_intel_byt dyndbg=+pmf options snd_sof_intel_bdw dyndbg=+pmf # PCI options snd_sof_pci dyndbg=+pmf options snd_sof_pci_intel_apl dyndbg=+pmf options snd_sof_pci_intel_cnl dyndbg=+pmf options snd_sof_pci_intel_icl dyndbg=+pmf options snd_sof_pci_intel_tgl dyndbg=+pmf options snd_sof_pci_intel_mtl dyndbg=+pmf options snd_sof_pci_intel_lnl dyndbg=+pmf # DSP selection options snd_intel_dspcfg dyndbg=+pmf options snd_intel_sdw_acpi dyndbg=+pmf # SOF internals options snd_sof_intel_ipc dyndbg=+pmf options snd_sof_intel_hda_common dyndbg=+pmf options snd_sof_intel_hda_mlink dyndbg=+pmf options snd_sof_intel_hda dyndbg=+pmf options snd_sof dyndbg=+pmf options snd_sof_nocodec dyndbg=+pmf # HDA options snd_hda_intel dyndbg=+pmf options snd-hda-codec-realtek dyndbg=+pmf options snd-hda-codec-generic dyndbg=+pmf options snd-hda-codec-hdmi dyndbg=+pmf options snd-hda-codec dyndbg=+pmf # SoundWire core options soundwire_bus dyndbg=+pmf options soundwire_generic_allocation dyndbg=+pmf options soundwire_cadence dyndbg=+pmf options soundwire_intel_init dyndbg=+pmf options soundwire_intel dyndbg=+pmf options snd_sof sof_debug=1 options snd_sof_pci_intel_tgl dyndbg=+pmf ```
plbossart commented 1 year ago
#define PCI_DEVICE_ID_INTEL_HDA_ADL_P   0x51c8

@kv2019i @ujfalusi wasn't there an issue already identified with snd_hdac_bus_send_cmd() ?

kv2019i commented 1 year ago

We have #4443 still open, not sure this is the same.

It seems we get stuck with i915. The display codec is visible on the bus:

dmesg_ERR.txt:

[    5.331766] snd_sof_intel_hda:hda_codec_detect_mask: sof-audio-pci-intel-tgl 0000:00:1f.3: codec_mask = 0x5
...
[    5.333441] snd_sof_intel_hda:hda_codec_probe: sof-audio-pci-intel-tgl 0000:00:1f.3: HDA codec #0 probed OK: response: fe05
...
# stuck in here, doing the codec probe
[   32.479753]  snd_hdac_bus_send_cmd+0xa2/0xc0 [snd_hda_core]

This could be a issue with the HDA link parameters. Kernel should timeout on the read, but the rootcause could be that the HDA link is not configured correctly and audio driver cannot communicate with the display codec. This configuration is supposed to be done by BIOS on boot, so maybe upon warm reboot (on this system), this programming is not done.

To shed light on this, @kovalev0 could you take drm.i915 debug logs of working and failing cases, as documented in https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs ?

I'd like to see following print (emitted from linux/drivers/gpu/drm/i915/display/intel_audio.c):

»       »       drm_dbg_kms(&i915->drm, "use AUD_FREQ_CNTRL of 0x%x (init value 0x%x)\n",
»       »       »           aud_freq, aud_freq_init);

No need to rebuild the kernel, logs can be enabled with module parameters.

kovalev0 commented 1 year ago

To shed light on this, @kovalev0 could you take drm.i915 debug logs of working and failing cases, as documented in https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs ?

working: journalctl-k-b_-1_drm_OK.txt

failing: journalctl-k-b_drm_ERR.txt

kv2019i commented 1 year ago

Thanks @kovalev0 . Hmm, so this actually looks to be ok:

окт 27 15:03:12 mimic7subtle kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
окт 27 15:03:12 mimic7subtle kernel: i915 0000:00:02.0: [drm:intel_power_well_enable [i915]] enabling PW_2
окт 27 15:03:12 mimic7subtle kernel: i915 0000:00:02.0: [drm:i915_audio_component_get_power [i915]] restored AUD_FREQ_CNTRL to 0x8010

So display codec is powered up and programmed with correct settings (0x8010 as in the success case log).

Now when I look at the logs again, I think this is probably the Realtek codec after all. The error comes after

[    5.333441] snd_sof_intel_hda:hda_codec_probe: sof-audio-pci-intel-tgl 0000:00:1f.3: HDA codec #0 probed OK: response: fe05

... but backtrace shows hda_codec_device_init(), so we are probably still initializing codec #0 (i.e. the Realtek external codec). So IOW, this very much looks another case of https://github.com/thesofproject/linux/issues/4443

plbossart commented 8 months ago

@kovalev0 do you still see a problem or can we close this issue as solved?

frostbyte-ninja commented 8 months ago

@plbossart I recently encountered this issue with Linux version 6.7.1, indicating it is not resolved yet. However, I believe it is indeed a duplicate of #4443.

nicktelindert commented 6 months ago

This is indeed the same problem as i reported #4443 , the issue is still there at kernel 6.8. I am very curious if @frostbyte-ninja also has this issue on a HP laptop. I think the bios might be to blame here because i see mentioning of a buggy bios in the logging? But to be fair it does not occur on FreeBSD 14. I know it is a different system but i can imagine that there might be some similarities between the drivers? Since it is also open source it might be a helpful to see how they handle stuff?

frostbyte-ninja commented 6 months ago

I would have assumed that this is not exotic hardware, so I'm a bit surprised that not more people seem to be affected by the issue.

@nicktelindert Yep, it's a "HP 250 G10"

nicktelindert commented 6 months ago

The issue has been there for a long time: https://bugzilla.kernel.org/show_bug.cgi?id=217440 Caused by a commit to the 5.15 kernel, reverting the commit reported in the linked issue will fix it.

nicktelindert commented 6 months ago

I compiled a custom kernel with these changes reverted: https://github.com/torvalds/linux/commit/5aec989

The change was made to fix an issue for ALC236/ALC256 but apparently it introduced a lot of issues at least for ALC236. After removing these changes the deadlock issue is completely gone. It also fixes issues i have headphones not working(jack sense).

I made a repository with a linux kernel that does work without any issues: https://github.com/nicktelindert/linux

I only changed the realtek_patch.c