Runing techsupport with very large coredumps might cause page allocation failures for a brief period

vivekrnv commented 1 year ago

Description

When Techsupport tried to move and compress very large coredumps, we can briefly see page allocation failures for processes when they to try allocate larger memory blocks.

Steps to reproduce the issue:

Reduce counter polling period to lowest possible number
Run show_techsupport/test_auto_techsupport.py::test_max_limit[core] test.
This test creates very huge coredump files Eg: in this test, it created 4*1.5 GB sized dumps i.e. 6 GB.

Describe the results you received:

Page alllocation failures might be seen in syncd during this time.

Apr 13 07:29:59.610841 r-anaconda-15 ERR syncd#SDK: [SAI_OBJECT.ERR] mlnx_sai_object.c[710]- mlnx_allocate_sx_bulk_buffer: Failed to create buffer: Driver's Return Status is Non-Zero.
Apr 13 07:29:59.610906 r-anaconda-15 ERR syncd#SDK: [SAI_BUFFER.ERR] mlnx_sai_buffer.c[5829]- mlnx_sai_bulk_ingress_priority_group_stats_get: Failed to prepare bulk counter for pg occupancy stats: -1.
Apr 13 07:29:59.610932 r-anaconda-15 ERR syncd#SDK: [SAI_OBJECT.ERR] mlnx_sai_object.c[728]- mlnx_deallocate_sx_bulk_buffer: Failed to destroy buffer: Driver's Return Status is Non-Zero.
Apr 13 07:29:59.610954 r-anaconda-15 ERR syncd#SDK: [SAI_BUFFER.ERR] mlnx_sai_buffer.c[5937]- mlnx_sai_bulk_ingress_priority_group_stats_get: Failed to deallocate SDK occupancy buffer: -1.
Apr 13 07:29:59.610978 r-anaconda-15 WARNING syncd#SDK: :- bulkCollectData: Failed to bulk get stats for Priority Group Counter: 4294967295
Apr 13 07:29:59.618714 r-anaconda-15 WARNING kernel: [ 5578.086015] syncd: page allocation failure: order:7, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=5429967a2a8affb9ee29a9acf730a8f0ed91211ee3b8b6822ceb6d92e86fb1fc,mems_allowed=0
Apr 13 07:29:59.618755 r-anaconda-15 WARNING kernel: [ 5578.086026] CPU: 1 PID: 136508 Comm: syncd Tainted: G        W  OE     5.10.0-18-2-amd64 #1 Debian 5.10.140-1
Apr 13 07:29:59.618760 r-anaconda-15 WARNING kernel: [ 5578.086028] Hardware name: Mellanox Technologies Ltd. MSN3700C/VMOD0005, BIOS 5.11 07/12/2021
Apr 13 07:29:59.618764 r-anaconda-15 WARNING kernel: [ 5578.086029] Call Trace:
Apr 13 07:29:59.618766 r-anaconda-15 WARNING kernel: [ 5578.086039]  dump_stack+0x6b/0x83
Apr 13 07:29:59.618799 r-anaconda-15 WARNING kernel: [ 5578.086043]  warn_alloc+0x138/0x160
Apr 13 07:29:59.618802 r-anaconda-15 WARNING kernel: [ 5578.086046]  ? __alloc_pages_direct_compact+0x167/0x170
Apr 13 07:29:59.618805 r-anaconda-15 WARNING kernel: [ 5578.086049]  __alloc_pages_slowpath.constprop.0+0xc2d/0xc60
Apr 13 07:29:59.618807 r-anaconda-15 WARNING kernel: [ 5578.086053]  ? try_charge+0x6a6/0x7e0
Apr 13 07:29:59.618810 r-anaconda-15 WARNING kernel: [ 5578.086057]  __alloc_pages_nodemask+0x2de/0x310
Apr 13 07:29:59.618813 r-anaconda-15 WARNING kernel: [ 5578.086062]  kmalloc_order+0x28/0x100
Apr 13 07:29:59.618816 r-anaconda-15 WARNING kernel: [ 5578.086064]  kmalloc_order_trace+0x19/0x90
Apr 13 07:29:59.618818 r-anaconda-15 WARNING kernel: [ 5578.086229]  sx_core_mmap+0x6c/0x1c0 [sx_core]
Apr 13 07:29:59.618834 r-anaconda-15 WARNING kernel: [ 5578.086236]  mmap_region+0x484/0x700
Apr 13 07:29:59.618837 r-anaconda-15 WARNING kernel: [ 5578.086239]  do_mmap+0x446/0x5a0
Apr 13 07:29:59.618839 r-anaconda-15 WARNING kernel: [ 5578.086244]  ? security_mmap_file+0x81/0xb0
Apr 13 07:29:59.618841 r-anaconda-15 WARNING kernel: [ 5578.086248]  vm_mmap_pgoff+0xd7/0x120
Apr 13 07:29:59.618843 r-anaconda-15 WARNING kernel: [ 5578.086252]  ksys_mmap_pgoff+0x1ec/0x260
Apr 13 07:29:59.618846 r-anaconda-15 WARNING kernel: [ 5578.086257]  do_syscall_64+0x33/0x80
Apr 13 07:29:59.618854 r-anaconda-15 WARNING kernel: [ 5578.086261]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
Apr 13 07:29:59.618857 r-anaconda-15 WARNING kernel: [ 5578.086264] RIP: 0033:0x7fd8fad25492
Apr 13 07:29:59.618860 r-anaconda-15 WARNING kernel: [ 5578.086268] Code: e4 e8 a2 60 01 00 66 90 41 f7 c1 ff 0f 00 00 75 27 55 48 89 fd 53 89 cb 48 85 ff 74 3b 41 89 da 48 89 ef b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 66 5b 5d c3 0f 1f 00 48 8b 05 c9 89 0d 00 64
Apr 13 07:29:59.618863 r-anaconda-15 WARNING kernel: [ 5578.086270] RSP: 002b:00007fd8cbff8d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
Apr 13 07:29:59.618866 r-anaconda-15 WARNING kernel: [ 5578.086272] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fd8fad25492
Apr 13 07:29:59.618869 r-anaconda-15 WARNING kernel: [ 5578.086274] RDX: 0000000000000003 RSI: 000000000005c950 RDI: 0000000000000000
Apr 13 07:29:59.618877 r-anaconda-15 WARNING kernel: [ 5578.086275] RBP: 0000000000000000 R08: 0000000000000024 R09: 0000000000000000
Apr 13 07:29:59.618880 r-anaconda-15 WARNING kernel: [ 5578.086277] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
Apr 13 07:29:59.618883 r-anaconda-15 WARNING kernel: [ 5578.086278] R13: 000000000005c950 R14: 00007fd8cbffa410 R15: 00007fd8cbff92f0
Apr 13 07:29:59.618885 r-anaconda-15 WARNING kernel: [ 5578.086281] Mem-Info:
Apr 13 07:29:59.618888 r-anaconda-15 WARNING kernel: [ 5578.086286] active_anon:35070 inactive_anon:644445 isolated_anon:0
Apr 13 07:29:59.618891 r-anaconda-15 WARNING kernel: [ 5578.086286]  active_file:156104 inactive_file:810759 isolated_file:0
Apr 13 07:29:59.618893 r-anaconda-15 WARNING kernel: [ 5578.086286]  unevictable:0 dirty:10935 writeback:0
Apr 13 07:29:59.618902 r-anaconda-15 WARNING kernel: [ 5578.086286]  slab_reclaimable:33306 slab_unreclaimable:261065
Apr 13 07:29:59.618904 r-anaconda-15 WARNING kernel: [ 5578.086286]  mapped:98550 shmem:48352 pagetables:5249 bounce:0
Apr 13 07:29:59.618907 r-anaconda-15 WARNING kernel: [ 5578.086286]  free:37732 free_pcp:326 free_cma:0
Apr 13 07:29:59.618910 r-anaconda-15 WARNING kernel: [ 5578.086291] Node 0 active_anon:140280kB inactive_anon:2577780kB active_file:624416kB inactive_file:3243036kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:394200kB dirty:43740kB writeback:0kB shmem:193408kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 876544kB writeback_tmp:0kB kernel_stack:12032kB all_unreclaimable? no
Apr 13 07:29:59.618914 r-anaconda-15 WARNING kernel: [ 5578.086292] Node 0 DMA free:15884kB min:132kB low:164kB high:196kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15968kB managed:15884kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Apr 13 07:29:59.618922 r-anaconda-15 WARNING kernel: [ 5578.086297] lowmem_reserve[]: 0 1831 7833 7833 7833
Apr 13 07:29:59.618925 r-anaconda-15 WARNING kernel: [ 5578.086302] Node 0 DMA32 free:49364kB min:15536kB low:19420kB high:23304kB reserved_highatomic:0KB active_anon:8kB inactive_anon:107328kB active_file:122636kB inactive_file:1543060kB unevictable:0kB writepending:39292kB present:1974420kB managed:1875184kB mlocked:0kB pagetables:40kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Apr 13 07:29:59.618928 r-anaconda-15 WARNING kernel: [ 5578.086307] lowmem_reserve[]: 0 0 6002 6002 6002
Apr 13 07:29:59.618931 r-anaconda-15 WARNING kernel: [ 5578.086312] Node 0 Normal free:85680kB min:51912kB low:64888kB high:77864kB reserved_highatomic:0KB active_anon:140272kB inactive_anon:2470452kB active_file:501780kB inactive_file:1699968kB unevictable:0kB writepending:4448kB present:6291456kB managed:6146536kB mlocked:0kB pagetables:20956kB bounce:0kB free_pcp:1304kB local_pcp:384kB free_cma:0kB
Apr 13 07:29:59.618933 r-anaconda-15 WARNING kernel: [ 5578.086316] lowmem_reserve[]: 0 0 0 0 0
Apr 13 07:29:59.618941 r-anaconda-15 WARNING kernel: [ 5578.086321] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB
Apr 13 07:29:59.618944 r-anaconda-15 WARNING kernel: [ 5578.086340] Node 0 DMA32: 462*4kB (UME) 379*8kB (UME) 471*16kB (UME) 467*32kB (UME) 213*64kB (UME) 59*128kB (UME) 6*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 50080kB
Apr 13 07:29:59.618946 r-anaconda-15 WARNING kernel: [ 5578.086359] Node 0 Normal: 3476*4kB (UME) 1148*8kB (UME) 1707*16kB (UME) 579*32kB (UME) 193*64kB (UME) 39*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86272kB
Apr 13 07:29:59.618948 r-anaconda-15 INFO kernel: [ 5578.086376] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Apr 13 07:29:59.618950 r-anaconda-15 INFO kernel: [ 5578.086378] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Apr 13 07:29:59.618952 r-anaconda-15 WARNING kernel: [ 5578.086379] 1013010 total pagecache pages
Apr 13 07:29:59.618960 r-anaconda-15 WARNING kernel: [ 5578.086380] 0 pages in swap cache
Apr 13 07:29:59.618962 r-anaconda-15 WARNING kernel: [ 5578.086382] Swap cache stats: add 0, delete 0, find 0/0
Apr 13 07:29:59.618964 r-anaconda-15 WARNING kernel: [ 5578.086383] Free swap  = 0kB
Apr 13 07:29:59.618966 r-anaconda-15 WARNING kernel: [ 5578.086383] Total swap = 0kB
Apr 13 07:29:59.618969 r-anaconda-15 WARNING kernel: [ 5578.086384] 2070461 pages RAM
Apr 13 07:29:59.618971 r-anaconda-15 WARNING kernel: [ 5578.086385] 0 pages HighMem/MovableOnly
Apr 13 07:29:59.618986 r-anaconda-15 WARNING kernel: [ 5578.086386] 61060 pages reserved
Apr 13 07:29:59.618988 r-anaconda-15 WARNING kernel: [ 5578.086387] 0 pages hwpoisoned

Specifically check the line, when the sdk tried to allocate bulk buffer, there weren't any pages of sizes (256kb+) available because the techsupport process had to copy and compress these huge coredump files.

Apr 13 07:29:59.618946 r-anaconda-15 WARNING kernel: [ 5578.086359] Node 0 Normal: 3476*4kB (UME) 1148*8kB (UME) 1707*16kB (UME) 579*32kB (UME) 193*64kB (UME) 39*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86272kB

Triage

This is not a common event since core-dumps doesn't tend to be this big. However, the ulimit is set to unlimited in sonic and thus there is no restriction on how large the dumps can be and the page allocation failure might happen to any process which is using huge pages.

Also, since techsupport can be trigged in the background by auto-techsupport and not just manually, I think the coredumps size has to be restricted to a reasonable value

Describe the results you expected:

No Problems because of techsupport running

Output of `show version`:

(paste your output here)

Output of `show techsupport`:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

sysdump_sonic_dump_r-anaconda-15_20230413_073825.tar.gz

arlakshm commented 1 year ago

@vivekrnv to confirm is this problem is during the zip in techsupport generation.

SavchukRomanLv commented 1 year ago

Hi @arlakshm any plans to fix this? Thank you!

bingwang-ms commented 10 months ago

Saw similar error logs in the techsupport test. Please check if it's safe to ignore.

Dec  3 16:22:31.639554 str-msn2700-06 ERR syncd#SDK: [SAI_OBJECT.ERR] ./src/mlnx_sai_object.c[787]- mlnx_wait_for_bulk_read_event: Failed to wait for an event: Connection timed out. 
Dec  3 16:22:31.639871 str-msn2700-06 ERR syncd#SDK: [SAI_PORT.ERR] ./src/mlnx_sai_port.c[6831]- mlnx_get_port_stats_ext: Failed to prepare bulk counter for port stats. 
Dec  3 16:22:31.640102 str-msn2700-06 ERR syncd#SDK: :- collectData: Failed to get stats of Port Counter 0x1003900000001: -1

sonic-net / sonic-buildimage