thesofproject / linux

Linux kernel source tree
Other
90 stars 129 forks source link

[ADL]: DSP fw load failure consistently due to memory alloc failure #3844

Closed jairaj-arava closed 2 years ago

jairaj-arava commented 2 years ago

While running suspend stress test with a Youtube playback, consistently soon after system resume SOF fw load is failing due to memory alloc failure as shown below:

[ 379.480206] sof-audio-pci-intel-tgl 0000:00:1f.3: loading firmware [ 379.480213] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIR: 00000002 at 00000048 [ 379.480218] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIW: 00000000 at 00000048 [ 379.480224] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIR: 00000000 at 00000044 [ 379.480228] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIW: 00000004 at 00000044 [ 379.480229] sof-audio-pci-intel-tgl 0000:00:1f.3: booting DSP firmware [ 379.484803] sof-audio-pci-intel-tgl 0000:00:1f.3: error: memory alloc failed: -12 [ 379.484810] sof-audio-pci-intel-tgl 0000:00:1f.3: error: dma prepare for ICCMAX stream failed [ 379.484812] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to reset DSP [ 379.484814] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to boot DSP firmware after resume -12 [ 379.484817] sof-audio-pci-intel-tgl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_pm_runtime_get on 0000:00:1f.3: -12 [ 379.484822] SSP1-Codec: soc_pcm_open() failed (-12) [ 379.484824] Speakers: ASoC: dpcm_be_dai_startup() failed at SSP1-Codec (-12) [ 379.484827] Speakers: dpcm_fe_dai_startup() failed (-12) [ 380.547828] sof-audio-pci-intel-tgl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_pm_runtime_get on 0000:00:1f.3: -22 [ 380.547834] SSP1-Codec: soc_pcm_open() failed (-22) [ 380.547837] Speakers: ASoC: dpcm_be_dai_startup() failed at SSP1-Codec (-22) [ 380.547839] Speakers: dpcm_fe_dai_startup() failed (-22)

Below are the steps to repro the issue quickly: Restrict memory to 2G. This can be done by adding the mem=4g command line option to the linux kernel. Here's an example of how to do that: /usr/share/vboot/bin/make_dev_ssd.sh --remove_rootfs_verification --partitions "2 4" && reboot

on next reboot:

rootdev

/dev/mmcblk0p5 Partition 4 holds current kernel. If rootdev had been mmcblk0p3, it would have been partition 2 /usr/share/vboot/bin/make_dev_ssd.sh --edit_config --partitions 4 Add mem=4G to the end of the command line and reboot.

On the next reboot:

cat /proc/cmdline

cros_secure console= loglevel=7 init=/sbin/init cros_secure drm.trace=0x106 root=PARTUUID=ea23107f-815b-1f4e-93a4-ef3e1e14c915/PARTNROFF=1 rootwait rw dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=0 dm="1 vroot none ro 1,0 5324800 verity payload=ROOT_DEV hashtree=HASH_DEV hashstart=5324800 alg=sha256 root_hexdigest=06fa741b1ba01512f692173dd0a735d52410068b9ee77af537bb654c90057d1d salt=89acb6f5b0d731a492b52f2601378a9e1b5a290e1633337c5262128d241eb9a1" noinitrd cros_debug vt.global_cursor_default=0 kern_guid=ea23107f-815b-1f4e-93a4-ef3e1e14c915 add_efi_memmap boot=local noresume noswap i915.modeset=1 disablevmx=off kvm-intel.vmentry_l1d_flush=always i915.enable_dpcd_backlight=1 i915.enable_dbc=1 i915.enable_guc_loading=1 i915.enable_guc_submission=0 mem=4g

head -1 /proc/meminfo MemTotal: 1948356 kB Notice the mem=4g parameter at the end of the command line. The actual memory available is less than 2G.

Now, log in and run youtube. After a video is playing, you can suspend & resume the system with the following command: /usr/bin/suspend_stress_test -c 1 --suspend_max 1 --suspend_min 1 If the system comes back up successfully, log in and restart the youtube video. Then suspend again, restart again, etc. Within a few minutes the resume of the audio driver will fail with the messages shown above.

Here's the bash loop can be used on the DUT to automate the suspend/resume, login, and resume of the video: while true; do /usr/bin/suspend_stress_test -c 1 --suspend_max 1 --suspend_min 1; sleep 20; /usr/local/sbin/inject-keys.py -s $PASSWORD -k enter; sleep 5; /usr/local/sbin/inject-keys.py -k " "; sleep 45; done

Note: where $PASSWORD is the password for the test account used to login into the device.

Attached the dmesg log with dynamic logs enabled: OOM_dmesg.txt

jairaj-arava commented 2 years ago

The same issue is seen on KBL Eve platform too. However @crojewsk has submitted the fix https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3792265 which reduced the frequency of occurance of the issue.

@sathya-nujella @sathyap-chrome @cujomalainey @ranj063 @plbossart @kv2019i @Vamshigopal

kv2019i commented 2 years ago

FYI to @ujfalusi as well. It seems a snd_dma_alloc_pages() fails allocating memory that can fit the firmware. One option would of course be to cache the allocation. This would cut the available memory for use-cases, but ensure we have enough at each runtime resume.

kv2019i commented 2 years ago

@jairaj-arava SOF driver is already using the snd_dma_alloc_pages() so that change for SKL doesn't help here. The options are not so great if free memory is low and fragmented. Let us try to reproduce this.

ujfalusi commented 2 years ago

Can you check the sound/core/memalloc.c in case you are missing some patches? There were a regression which manifested like this caused by 2c95b92ecd92e784785b1db8cccc4f0f2bfa850c ALSA: memalloc: Unify x86 SG-buffer handling (take#3)

If I recall right this patch was fixing that issue 925ca893b4a65177394581737b95d03fea2660f2 ALSA: memalloc: Add fallback SG-buffer allocations for x86

and like the following patch is needed: 9882d63bea14c8b3ed2c9360b9ab9f0e2f64ae2b ALSA: memalloc: Drop x86-specific hack for WC allocations

kv2019i commented 2 years ago

I think I found a bug that can explain this. @jairaj-arava , can you try this https://github.com/thesofproject/linux/pull/3853

jairaj-arava commented 2 years ago

I think I found a bug that can explain this. @jairaj-arava , can you try this #3853 I still see the issue with the #3853.

plbossart commented 2 years ago

@jairaj-arava that would mean the platform is not able to allocate 4KB? Can you attach the latest logs with this #3853?

jairaj-arava commented 2 years ago

@jairaj-arava that would mean the platform is not able to allocate 4KB? Can you attach the latest logs with this #3853?

We tried the below scenarios and could still see the issue in all scenarios . For the step 3, attached the dmesg logs and couldn't see fw logs since fw boot is failed (Attached fw logs dump just before we hit the issue).

  1. ADL-004 release
  2. ADL-004 release + IMR
  3. ADL-004 release + IMR + Kai's #3853 patch SOF_DMESG_logs.txt SOF_fw_logs.txt
kv2019i commented 2 years ago

@jairaj-arava I've been trying to reproduce this on a local machine for couple of hours, but with no luck (I'll contact you separately to sync the details of the environment). This is running a bit older Cros (R100), but I thought to start with this. I've monitored /proc/meminfo and I can see "Slab" and "SUnreclaim" grow a bit, and also overall "Committed_AS", but nothing drastic and I've not been able to hit the error. Could you add capture of /proc/meminfo to your test setup and capture the output at each iterator. I'd be interested to see the first and last meminfo dumps and a few captures in between.

We could also enable kmemleak in kernel and see if that catches any leaks in the test.

I also tested similar procedure with plain mainline SOF in a non-Cros system, and I can confirm memory usage is not growing if I do suspend test during playback.

ujfalusi commented 2 years ago

Locally tested workaround: https://github.com/thesofproject/linux/pull/3872

The buffer allocation for ICCMAX will be also skipped in case of IMR booting.

@jairaj-arava, can you test it on your system?

jairaj-arava commented 2 years ago

@jairaj-arava I've been trying to reproduce this on a local machine for couple of hours, but with no luck (I'll contact you separately to sync the details of the environment). This is running a bit older Cros (R100), but I thought to start with this. I've monitored /proc/meminfo and I can see "Slab" and "SUnreclaim" grow a bit, and also overall "Committed_AS", but nothing drastic and I've not been able to hit the error. Could you add capture of /proc/meminfo to your test setup and capture the output at each iterator. I'd be interested to see the first and last meminfo dumps and a few captures in between.

We could also enable kmemleak in kernel and see if that catches any leaks in the test.

I also tested similar procedure with plain mainline SOF in a non-Cros system, and I can confirm memory usage is not growing if I do suspend test during playback.

Hi @kv2019i , Attached is the memory info dump captured. I consistently see there is a good amount of memory available in the dump.

Memory_dump_new.txt

ujfalusi commented 2 years ago

The blueprint of this is really similar to what we had not long ago within 5.17 cycle: https://github.com/thesofproject/linux/issues/3609 , https://github.com/thesofproject/linux/issues/3530 , https://github.com/thesofproject/linux/issues/3584

In all cases we had good amount of memory available, yet the DMA memory allocation failed. I have checked sound/core/memalloc.c, sgbuf.c for clues and patterns, but I don't think there were similar issues reported with 5.10.

kv2019i commented 2 years ago

@jairaj-arava Please try this out: https://github.com/kv2019i/linux/commit/09ce62ecd724c02b6b434979b68fb071094b975d

jairaj-arava commented 2 years ago

@jairaj-arava Please try this out: kv2019i@09ce62e

sure @kv2019i. Will try this and update.

jairaj-arava commented 2 years ago

@jairaj-arava Please try this out: kv2019i@09ce62e

sure @kv2019i. Will try this and update.

So far with this fix no issue is observed in 2 boards with 10+ hours of testing and the other 4 boards running from 6+ hours.

kv2019i commented 2 years ago

Test results are good and patch merged upstream, so closing the bug.