Closed jairaj-arava closed 2 years ago
The same issue is seen on KBL Eve platform too. However @crojewsk has submitted the fix https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3792265 which reduced the frequency of occurance of the issue.
@sathya-nujella @sathyap-chrome @cujomalainey @ranj063 @plbossart @kv2019i @Vamshigopal
FYI to @ujfalusi as well. It seems a snd_dma_alloc_pages() fails allocating memory that can fit the firmware. One option would of course be to cache the allocation. This would cut the available memory for use-cases, but ensure we have enough at each runtime resume.
@jairaj-arava SOF driver is already using the snd_dma_alloc_pages() so that change for SKL doesn't help here. The options are not so great if free memory is low and fragmented. Let us try to reproduce this.
Can you check the sound/core/memalloc.c
in case you are missing some patches?
There were a regression which manifested like this caused by
2c95b92ecd92e784785b1db8cccc4f0f2bfa850c ALSA: memalloc: Unify x86 SG-buffer handling (take#3)
If I recall right this patch was fixing that issue 925ca893b4a65177394581737b95d03fea2660f2 ALSA: memalloc: Add fallback SG-buffer allocations for x86
and like the following patch is needed: 9882d63bea14c8b3ed2c9360b9ab9f0e2f64ae2b ALSA: memalloc: Drop x86-specific hack for WC allocations
I think I found a bug that can explain this. @jairaj-arava , can you try this https://github.com/thesofproject/linux/pull/3853
I think I found a bug that can explain this. @jairaj-arava , can you try this #3853 I still see the issue with the #3853.
@jairaj-arava that would mean the platform is not able to allocate 4KB? Can you attach the latest logs with this #3853?
@jairaj-arava that would mean the platform is not able to allocate 4KB? Can you attach the latest logs with this #3853?
We tried the below scenarios and could still see the issue in all scenarios . For the step 3, attached the dmesg logs and couldn't see fw logs since fw boot is failed (Attached fw logs dump just before we hit the issue).
@jairaj-arava I've been trying to reproduce this on a local machine for couple of hours, but with no luck (I'll contact you separately to sync the details of the environment). This is running a bit older Cros (R100), but I thought to start with this. I've monitored /proc/meminfo and I can see "Slab" and "SUnreclaim" grow a bit, and also overall "Committed_AS", but nothing drastic and I've not been able to hit the error. Could you add capture of /proc/meminfo to your test setup and capture the output at each iterator. I'd be interested to see the first and last meminfo dumps and a few captures in between.
We could also enable kmemleak in kernel and see if that catches any leaks in the test.
I also tested similar procedure with plain mainline SOF in a non-Cros system, and I can confirm memory usage is not growing if I do suspend test during playback.
Locally tested workaround: https://github.com/thesofproject/linux/pull/3872
The buffer allocation for ICCMAX will be also skipped in case of IMR booting.
@jairaj-arava, can you test it on your system?
@jairaj-arava I've been trying to reproduce this on a local machine for couple of hours, but with no luck (I'll contact you separately to sync the details of the environment). This is running a bit older Cros (R100), but I thought to start with this. I've monitored /proc/meminfo and I can see "Slab" and "SUnreclaim" grow a bit, and also overall "Committed_AS", but nothing drastic and I've not been able to hit the error. Could you add capture of /proc/meminfo to your test setup and capture the output at each iterator. I'd be interested to see the first and last meminfo dumps and a few captures in between.
We could also enable kmemleak in kernel and see if that catches any leaks in the test.
I also tested similar procedure with plain mainline SOF in a non-Cros system, and I can confirm memory usage is not growing if I do suspend test during playback.
Hi @kv2019i , Attached is the memory info dump captured. I consistently see there is a good amount of memory available in the dump.
The blueprint of this is really similar to what we had not long ago within 5.17 cycle: https://github.com/thesofproject/linux/issues/3609 , https://github.com/thesofproject/linux/issues/3530 , https://github.com/thesofproject/linux/issues/3584
In all cases we had good amount of memory available, yet the DMA memory allocation failed. I have checked sound/core/memalloc.c, sgbuf.c for clues and patterns, but I don't think there were similar issues reported with 5.10.
@jairaj-arava Please try this out: https://github.com/kv2019i/linux/commit/09ce62ecd724c02b6b434979b68fb071094b975d
@jairaj-arava Please try this out: kv2019i@09ce62e
sure @kv2019i. Will try this and update.
@jairaj-arava Please try this out: kv2019i@09ce62e
sure @kv2019i. Will try this and update.
So far with this fix no issue is observed in 2 boards with 10+ hours of testing and the other 4 boards running from 6+ hours.
Test results are good and patch merged upstream, so closing the bug.
While running suspend stress test with a Youtube playback, consistently soon after system resume SOF fw load is failing due to memory alloc failure as shown below:
[ 379.480206] sof-audio-pci-intel-tgl 0000:00:1f.3: loading firmware [ 379.480213] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIR: 00000002 at 00000048 [ 379.480218] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIW: 00000000 at 00000048 [ 379.480224] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIR: 00000000 at 00000044 [ 379.480228] sof-audio-pci-intel-tgl 0000:00:1f.3: Debug PCIW: 00000004 at 00000044 [ 379.480229] sof-audio-pci-intel-tgl 0000:00:1f.3: booting DSP firmware [ 379.484803] sof-audio-pci-intel-tgl 0000:00:1f.3: error: memory alloc failed: -12 [ 379.484810] sof-audio-pci-intel-tgl 0000:00:1f.3: error: dma prepare for ICCMAX stream failed [ 379.484812] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to reset DSP [ 379.484814] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to boot DSP firmware after resume -12 [ 379.484817] sof-audio-pci-intel-tgl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_pm_runtime_get on 0000:00:1f.3: -12 [ 379.484822] SSP1-Codec: soc_pcm_open() failed (-12) [ 379.484824] Speakers: ASoC: dpcm_be_dai_startup() failed at SSP1-Codec (-12) [ 379.484827] Speakers: dpcm_fe_dai_startup() failed (-12) [ 380.547828] sof-audio-pci-intel-tgl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_pm_runtime_get on 0000:00:1f.3: -22 [ 380.547834] SSP1-Codec: soc_pcm_open() failed (-22) [ 380.547837] Speakers: ASoC: dpcm_be_dai_startup() failed at SSP1-Codec (-22) [ 380.547839] Speakers: dpcm_fe_dai_startup() failed (-22)
Below are the steps to repro the issue quickly: Restrict memory to 2G. This can be done by adding the mem=4g command line option to the linux kernel. Here's an example of how to do that: /usr/share/vboot/bin/make_dev_ssd.sh --remove_rootfs_verification --partitions "2 4" && reboot
on next reboot:
rootdev
/dev/mmcblk0p5 Partition 4 holds current kernel. If rootdev had been mmcblk0p3, it would have been partition 2 /usr/share/vboot/bin/make_dev_ssd.sh --edit_config --partitions 4 Add mem=4G to the end of the command line and reboot.
On the next reboot:
cat /proc/cmdline
cros_secure console= loglevel=7 init=/sbin/init cros_secure drm.trace=0x106 root=PARTUUID=ea23107f-815b-1f4e-93a4-ef3e1e14c915/PARTNROFF=1 rootwait rw dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=0 dm="1 vroot none ro 1,0 5324800 verity payload=ROOT_DEV hashtree=HASH_DEV hashstart=5324800 alg=sha256 root_hexdigest=06fa741b1ba01512f692173dd0a735d52410068b9ee77af537bb654c90057d1d salt=89acb6f5b0d731a492b52f2601378a9e1b5a290e1633337c5262128d241eb9a1" noinitrd cros_debug vt.global_cursor_default=0 kern_guid=ea23107f-815b-1f4e-93a4-ef3e1e14c915 add_efi_memmap boot=local noresume noswap i915.modeset=1 disablevmx=off kvm-intel.vmentry_l1d_flush=always i915.enable_dpcd_backlight=1 i915.enable_dbc=1 i915.enable_guc_loading=1 i915.enable_guc_submission=0 mem=4g
head -1 /proc/meminfo MemTotal: 1948356 kB Notice the mem=4g parameter at the end of the command line. The actual memory available is less than 2G.
Now, log in and run youtube. After a video is playing, you can suspend & resume the system with the following command: /usr/bin/suspend_stress_test -c 1 --suspend_max 1 --suspend_min 1 If the system comes back up successfully, log in and restart the youtube video. Then suspend again, restart again, etc. Within a few minutes the resume of the audio driver will fail with the messages shown above.
Here's the bash loop can be used on the DUT to automate the suspend/resume, login, and resume of the video: while true; do /usr/bin/suspend_stress_test -c 1 --suspend_max 1 --suspend_min 1; sleep 20; /usr/local/sbin/inject-keys.py -s $PASSWORD -k enter; sleep 5; /usr/local/sbin/inject-keys.py -k " "; sleep 45; done
Note: where $PASSWORD is the password for the test account used to login into the device.
Attached the dmesg log with dynamic logs enabled: OOM_dmesg.txt