thesofproject / linux

Linux kernel source tree
Other
88 stars 128 forks source link

[BUG] [CML] [HDA] page allocation failure when system enter low memory #5000

Open Vamshigopal opened 1 month ago

Vamshigopal commented 1 month ago

Describe the bug On CML chromebook device with legacy HDA driver, When system goes to low memory , we see page allocation failure for audio and audio stops working. We also see kernel crash after page allocation failures

To Reproduce

Boot the chromebook Restrict the system memory to 4gb Run memory intense workloads Paralley run youtube audio playback

Environment Kernel Branch: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/refs/tags/v5.15.152 Platform: CML

Logs dmesg.log

Screenshots or console output


[  143.995313] snd_hda_codec_realtek hdaudioC0D0: hda_codec_cleanup_stream: NID=0x3
[  143.995319] snd_hda_codec_realtek hdaudioC0D0: hda_codec_cleanup_stream: NID=0x2
**[  217.112589] modprobe: page allocation failure: order:4, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0**
[  217.112605] CPU: 3 PID: 5838 Comm: modprobe Tainted: G        W         5.15.152-22006-g5365727b3992 #1 38b847e2a0b5e38ed928959f12ca771020b542bf
[  217.112611] Hardware name: Dell Inc. Drallion/Drallion, BIOS Google_Drallion.12930.48.0 04/21/2020
[  217.112613] Call Trace:
[  217.112616]  <TASK>
[  217.112619]  dump_stack_lvl+0x69/0x97
[  217.112626]  warn_alloc+0x10c/0x165
[  217.112630]  ? psi_memstall_leave+0x7e/0x98
[  217.112636]  __alloc_pages+0x5dd/0x744
[  217.112640]  kmalloc_order+0x2e/0x86
[  217.112645]  kmalloc_order_trace+0x1e/0x8b
[  217.112649]  module_decompress+0xb2/0x285
[  217.112654]  __se_sys_finit_module+0xbc/0x148
[  217.112660]  do_syscall_64+0x51/0xa1
[  217.112664]  ? exit_to_user_mode_prepare+0x3c/0x84
[  217.112668]  entry_SYSCALL_64_after_hwframe+0x5c/0xc6
[  217.112673] RIP: 0033:0x7d1b93803d09
[  217.112677] Code: 5b 41 5c 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d                                                                                                                                 df 40 0c 00 f7 d8 64 89 01 48
[  217.112681] RSP: 002b:00007ffc8b116a68 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  217.112685] RAX: ffffffffffffffda RBX: 0000555b345db0a0 RCX: 00007d1b93803d09
[  217.112688] RDX: 0000000000000004 RSI: 0000555b32ab9bc3 RDI: 0000000000000000
[  217.112691] RBP: 00007ffc8b116ad0 R08: 0000555b345e6798 R09: 0000000000000000
[  217.112693] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000040000
[  217.112696] R13: 0000000000000000 R14: 0000000000000004 R15: 0000555b32ab9bc3
[  217.112700]  </TASK>
[  217.112702] Mem-Info:
[  217.112704] active_anon:194690 inactive_anon:447172 isolated_anon:0
                active_file:11164 inactive_file:20014 isolated_file:0
                unevictable:81 dirty:6759 writeback:5
                slab_reclaimable:15007 slab_unreclaimable:33187
                mapped:33405 shmem:26152 pagetables:10996 bounce:0
                kernel_misc_reclaimable:0
                free:16483 free_pcp:649 free_cma:0
Vamshigopal commented 1 month ago

I've taken all the recent fixes went in under sound/core/memalloc.c still the issue comes.

5365727b399276 (HEAD) ALSA: memalloc: Workaround for Xen PV fe074ccf1d6035 ALSA: memalloc: don't use GFPCOMP for non-coherent dma allocations ce7ba60e2f4f8f ALSA: memalloc: don't pass bogus GFP flags to dmaalloc* 809ca3aec74894 ALSA: memalloc: Allocate more contiguous pages for fallback case 0165554146733c ALSA: memalloc: Try dma_alloc_noncontiguous() at first 2c69c6c6950659 ALSA: memalloc: Don't fall back for SG-buffer with IOMMU 7c62355c56949d ALSA: memalloc: use __GFP_RETRY_MAYFAIL for DMA mem allocs 5bb7d534b7bec1 ALSA: hda: Once again fix regression of page allocations with IOMMU 594e13a86ff750 ALSA: doc: Drop snd_dma_continuous_data() usages fb786627247b77 ALSA: memalloc: Drop special handling of GFP for CONTINUOUS allocation 4c6fdc8ad281a0 ASoC: Intel: sst: Switch to standard device pages a5dd134ee8a619 ALSA: pdaudiocf: Drop superfluous GFP setup ace17309a5c63d ALSA: vx: Drop superfluous GFP setup 74f65b9821734e ALSA: memalloc: Revive x86-specific WC page allocations again 36e977f79ab0a9 ALSA: memalloc: Fix missing return value comments for kernel docs c3ec3d3224e6cc ALSA: memalloc: Drop x86-specific hack for WC allocations

Vamshigopal commented 1 month ago

cc: @kv2019i @plbossart @bardliao @sathya-nujella

plbossart commented 1 month ago

Can we try with a non-Chrome kernel to make sure this platform works first, before diving in the backport issues?

plbossart commented 1 month ago

BTW this is issue number FIVE THOUSAND. I don't know if I should cry or laugh.

Vamshigopal commented 1 month ago

Can we try with a non-Chrome kernel to make sure this platform works first, before diving in the backport issues?

I have used kernel 6.9.0-rc7 from https://chromium.googlesource.com/chromiumos/third_party/kernel/+/refs/heads/merge/continuous/chromeos-kernelupstream-6.9-rc7 Its same has upstream kernel only few additional chrome specific patches to support chrome boot. With this kernel i see the issue reproduces with same signature,

[ 1976.217652] perf: page allocation failure: order:4, mode:0xdc0(GFP_KERNEL|GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0 [ 1976.217668] CPU: 3 PID: 11754 Comm: perf Not tainted 6.9.0-rc7-g4157e5c9501e-dirty #1 47154153e3152498d7551147b11cdd2fdbee3ec5 [ 1976.217673] Hardware name: Dell Inc. Drallion/Drallion, BIOS Google_Drallion.12930.48.0 04/21/2020 [ 1976.217675] Call Trace: [ 1976.217678] [ 1976.217681] dump_stack_lvl+0x40/0xb0 [ 1976.217687] warn_alloc+0x10e/0x180 [ 1976.217693] alloc_pages_slowpath+0xda0/0xdf0 [ 1976.217697] alloc_pages+0x239/0x2d0 [ 1976.217701] reserve_ds_buffers+0x20a/0x4c0 [ 1976.217706] x86_reserve_hardware+0xd2/0x1c0 [ 1976.217710] x86_pmu_event_init+0x4e/0x320 [ 1976.217715] perf_try_init_event+0x63/0x120 [ 1976.217718] perf_event_alloc+0x4a6/0x760 [ 1976.217722] ksys_perf_event_open+0x2ad/0xa60 [ 1976.217725] ? flush_tlb_func+0xe0/0x1d0 [ 1976.217730] x64_sys_perf_event_open+0x22/0x30 [ 1976.217733] do_syscall_64+0x72/0xf0 [ 1976.217736] ? handle_mm_fault+0x8a6/0x9e0 [ 1976.217742] ? exc_page_fault+0x202/0x6a0 [ 1976.217745] ? clear_bhb_loop+0x45/0xa0 [ 1976.217749] ? clear_bhb_loop+0x45/0xa0 [ 1976.217752] ? clear_bhb_loop+0x45/0xa0 [ 1976.217754] ? clear_bhb_loop+0x45/0xa0 [ 1976.217757] entry_SYSCALL_64_after_hwframe+0x71/0x79 [ 1976.217761] RIP: 0033:0x7e6771f31d09 [ 1976.217764] Code: 5b 41 5c 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d df 40 0c 00 f7 d8 64 89 01 48 [ 1976.217768] RSP: 002b:00007ffd47999668 EFLAGS: 00000246 ORIG_RAX: 000000000000012a [ 1976.217772] RAX: ffffffffffffffda RBX: 0000585f3eefc2f0 RCX: 00007e6771f31d09 [ 1976.217774] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000585f3eefc300 [ 1976.217777] RBP: 00007ffd47999720 R08: 0000000000000008 R09: 0000000000000008 [ 1976.217780] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000585f3ee63a50 [ 1976.217782] R13: 00000000ffffffff R14: 0000000000000000 R15: 00000000ffffffff [ 1976.217786]

dmesg-6-9.log

plbossart commented 1 month ago

Thanks @Vamshigopal, this is helpful in that it's obviously not a backport issue, but the trace does not really point to a specific audio driver doing bad things. It's not even the SOF driver used but snd-hda-intel.

It seems to be a problem with memory management, notifying @tiwai @kv2019i since that's changed a lot since initial CML Chromebooks came out.

tiwai commented 1 month ago

Those are memory allocation failures of higher orders (4) by other code, and it implies that the system memory is highly fragmented. The only concern is whether this fragmentation happened by some memory leaks. If so, the leaks have to be fixed.

Vamshigopal commented 1 month ago

@tiwai we see this issues in the field devices, not sure on exact evironment. To reproduce faster i'm using https://github.com/stressapptest/stressapptest https://chromium.googlesource.com/chromiumos/platform/factory/+/HEAD/py/test/pytests/stressapptest.py

Can you suggest any experiments / debug prints to narrow down the issue further.

tiwai commented 1 month ago

It's no bug, per se, if it's really the result of a highly fragmented system. The allocation failure of higher order pages is no fatal error in general.

You can try to check whether there are memory leaks, e.g. examining the actual free pages, for example. Or try some kernel configs for debugging memory leaks.

Vamshigopal commented 1 month ago

Thanks @tiwai for suggesttions, I've added kernel config CONFIG_DEBUG_KMEMLEAK=y to check memory leaks , but is see this kmemleak: Memory pool empty, consider increasing CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE How much ever pool size we increase to , still we get same warning.

Can you please suggest any other kernl configs for memmory leaks and can you suggest how we can examine actual free pages ?

While the test is running using top ,i can see 60-80 mb is free, but not sure how many free pages we have.

Also i see CONFIG_COMPACTION is enable, This option enables memory compaction in the kernel, which attempts to reduce fragmentation by merging smaller free blocks into larger ones.