thesofproject / linux

Linux kernel source tree
Other
89 stars 129 forks source link

Hang on boots with snd_sof_pci #3865

Open basilbasilikum opened 2 years ago

basilbasilikum commented 2 years ago

Very similar to #2066 and probably also #3838

On Fedora 36, with clean install, some boots will hang (will not get to login screen, or even if it does, it hangs shortly after).

If options snd-intel-dspcfg dsp_driver=1 is added to /etc/modprobe.d, it'll boot fine with sound, but no mic.

sudo journalctl -b -1 --no-hostname after a force reboot is here, but probably not of any use because it's truncated and new lines are not added to disk.

Here is a video of the boot without rhgb quiet, there are more lines in here than the journalctl log.

lspci -nn:

00:00.0 Host bridge [0600]: Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers [8086:3ec4] (rev 07)
00:01.0 PCI bridge [0604]: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 07)
00:02.0 VGA compatible controller [0300]: Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630] [8086:3e9b]
00:04.0 Signal processing controller [1180]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [8086:1903] (rev 07)
00:08.0 System peripheral [0880]: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
00:12.0 Signal processing controller [1180]: Intel Corporation Cannon Lake PCH Thermal Controller [8086:a379] (rev 10)
00:13.0 Serial controller [0700]: Intel Corporation Device [8086:a37c] (rev 10)
00:14.0 USB controller [0c03]: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d] (rev 10)
00:14.2 RAM memory [0500]: Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f] (rev 10)
00:14.3 Network controller [0280]: Intel Corporation Cannon Lake PCH CNVi WiFi [8086:a370] (rev 10)
00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 [8086:a368] (rev 10)
00:15.1 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 [8086:a369] (rev 10)
00:16.0 Communication controller [0780]: Intel Corporation Cannon Lake PCH HECI Controller [8086:a360] (rev 10)
00:17.0 SATA controller [0106]: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller [8086:a353] (rev 10)
00:1b.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #17 [8086:a340] (rev f0)
00:1b.4 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #21 [8086:a32c] (rev f0)
00:1d.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 [8086:a330] (rev f0)
00:1f.0 ISA bridge [0601]: Intel Corporation HM470 Chipset LPC/eSPI Controller [8086:a30d] (rev 10)
00:1f.3 Multimedia audio controller [0401]: Intel Corporation Cannon Lake PCH cAVS [8086:a348] (rev 10)
00:1f.4 SMBus [0c05]: Intel Corporation Cannon Lake PCH SMBus Controller [8086:a323] (rev 10)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller [8086:a324] (rev 10)
02:00.0 PCI bridge [0604]: Intel Corporation JHL6340 Thunderbolt 3 Bridge (C step) [Alpine Ridge 2C 2016] [8086:15da] (rev 02)
03:00.0 PCI bridge [0604]: Intel Corporation JHL6340 Thunderbolt 3 Bridge (C step) [Alpine Ridge 2C 2016] [8086:15da] (rev 02)
03:01.0 PCI bridge [0604]: Intel Corporation JHL6340 Thunderbolt 3 Bridge (C step) [Alpine Ridge 2C 2016] [8086:15da] (rev 02)
03:02.0 PCI bridge [0604]: Intel Corporation JHL6340 Thunderbolt 3 Bridge (C step) [Alpine Ridge 2C 2016] [8086:15da] (rev 02)
04:00.0 System peripheral [0880]: Intel Corporation JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] [8086:15d9] (rev 02)
3a:00.0 USB controller [0c03]: Intel Corporation JHL6340 Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 2016] [8086:15db] (rev 02)
3b:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Blue SN550 NVMe SSD [15b7:5009] (rev 01)
3c:00.0 Non-Volatile memory controller [0108]: SK hynix BC501 NVMe Solid State Drive [1c5c:1327]

rpm -qa alsa\* | sort:

alsa-lib-1.2.7.2-1.fc36.x86_64
alsa-sof-firmware-2.1.1-1.fc36.noarch
alsa-ucm-1.2.7.2-1.fc36.noarch
alsa-utils-1.2.7-1.fc36.x86_64

uname -r:

5.19.8-200.fc36.x86_64

Laptop is Dell Inspiron 15 7590.

Also reported to https://bugzilla.redhat.com/show_bug.cgi?id=2127312

Tried to use v2.2 from https://github.com/thesofproject/sof-bin, but still the same.

plbossart commented 2 years ago

@basilbasilikum can you share the output of 'alsa-info', it's not clear what hardware you are using. Thanks!

Can you also add a 'blacklist snd-sof-pci-intel-cnl', boot and then do a manual 'modprobe snd-sof-pci-intel-cnl' that would help check if there's something blocking

better yet, add dynamic debug by copying this file sof-dyndbg.conf.txt as /etc/modprobe.d/sof-dyndbg.conf

basilbasilikum commented 2 years ago

alsa-info.sh: http://alsa-project.org/db/?f=d613e748b059dc93e8c761432a6be7e93f1d393c

I will try 'blacklist snd-sof-pci-intel-cnl' and sof-dyndbg.conf.txt and report back.

basilbasilikum commented 2 years ago

With blacklist snd-sof-pci-intel-cnl, and sudo modprobe snd-sof-pci-intel-cnl after boot, the command completed and exited without errors but immediately after exit the system hangs.

basilbasilikum commented 2 years ago

I have enabled dynamic debugging, but not sure how to get the logs file when it crash because:

  1. it doesn't always crash immediately
  2. when it crash, the logs are not synced to disk so I can't get it after a force reboot

I will try enabling dynamic debugging and boot without blacklist snd-sof-pci-intel-cnl and try to record the screen as it boots later today when I get home.

plbossart commented 2 years ago

ok, so

00:1f.3 Multimedia audio controller [0401]: Intel Corporation Cannon Lake PCH cAVS [8086:a348] (rev 10)

means a standard platform

 { PCI_DEVICE(0x8086, 0xa348), /* CNL-H */
        .driver_data = (unsigned long)&cfl_desc},

Please make sure you remove the dsp_driver option, that will select snd-hda-intel only.

I am not aware of any specific issues with that platform, quite surprising.

basilbasilikum commented 2 years ago

https://youtu.be/OA8kGVxzN10 here is a video of the boot with dynamic debugging and without dsp_driver nor blacklist snd-sof-pci-intel-cnl

I am not aware of any specific issues with that platform, quite surprising.

Could it be because it's specific to Dell or certain hardware more specific than just the platform? Or it's just a coincidence that #3838 is also on a Dell?

plbossart commented 1 year ago

Try this blacklist snd-sof-pci-intel-cnl reboot dmesg -C dmesg -w > log.txt in another terminal sudo modprobe snd-sof-pci-intel-cnl

if this hangs again, reboot and check what the log says. If there are any issues please attach the file. thanks!

basilbasilikum commented 1 year ago

The log.txt came out empty, probably because the file wasn't written to disk before the system hanged.

However, I tried using another PC to connect to the laptop via SSH to execute the dmesg commands and had some results: https://pastebin.com/KPQrp0Ka

plbossart commented 1 year ago

@basilbasilikum no real smoking gun, but the last thing that happens is to deal with a capture stream:

[   45.642275] snd_sof:sof_pcm_open: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: open stream 0 dir 1

This typically happens when PulseAudio or PipeWire try to start streams on startup. Can you disable both and try just booting the kernel, with no streams in use. Thanks!

basilbasilikum commented 1 year ago

Does running systemctl --user disable pipewire-pulse.socket, systemctl --user disable pipewire-pulse.socket, systemctl --user disable pipewire.socket and systemctl --user disable pipewire.service as non-root user, and boot without blacklist snd-sof-pci-intel-cnl suffice?

basilbasilikum commented 1 year ago

So apparently even if the services are disabled, they'll still start on boot. So I booted with blacklist snd-sof-pci-intel-cnl, and then stopped the services, then only I run sudo modprobe snd-sof-pci-intel-cnl. I have done this for 10+ boots, so far it still hasn't crash yet. But might be too soon to tell. I will loop it overnight later to be sure.

My current /etc/modprobe.d/blacklist_snd.conf:

blacklist snd-sof-pci-intel-cnl

The command I run after every boot: systemctl --user stop pipewire-pulse.service pipewire-pulse.socket pipewire.service pipewire.socket && sleep 9 && sudo modprobe snd-sof-pci-intel-cnl

basilbasilikum commented 1 year ago

I have looped the actions (reboot, stop all pipewire services/sockets, modprobe snd-sof-pci-intel-cnl) every five minutes for 10 hours, and I can confirm that if pipewire is stopped before modprobing, it will indeed not hang. If I don't stop pipewire before modprobing, it will hang.

plbossart commented 1 year ago

@basilbasilikum thanks for the tests, what happens if you restart PipeWire e.g. 10s after the 'sudo modprobe snd-sof-pci-intel-cnl'

FWIW we have a similar sighting on a different platform, so chances are you were the first reporter of a problem we missed in our validation.

basilbasilikum commented 1 year ago

It still hangs after PipeWire is started.

plbossart commented 1 year ago

Thanks @basilbasilikum, this really helps.

plbossart commented 1 year ago

Sorry one more question @basilbasilikum. When you start pipewire, is the entire device stuck, or just the terminal where you issued the systemctl command? In the latter case, can you e.g. open another terminal and see the dmesg log?

basilbasilikum commented 1 year ago

The entire device is stuck, even REISUB doesn't work. So I can't see the dmesg log on that device. I can try using SSH from another device to get the dmesg log like before later.

basilbasilikum commented 1 year ago

Here's the dmesg log: https://pastebin.com/tq6Ych5X

10 seconds after I start PipeWire, it hangs.

basilbasilikum commented 1 year ago

Just to be clear, I connected to the device from another via SSH before I start PipeWire, and ran dmesg -C then dmesg -w before starting PipeWire. SSH does not work after it hangs.

plbossart commented 1 year ago

Thanks @basilbasilikum. I think we'll have to try to reproduce this on our side to see what happens, this is a weird issue indeed.