Open vivekrnv opened 1 year ago
@vivekrnv to confirm is this problem is during the zip in techsupport generation.
Hi @arlakshm any plans to fix this? Thank you!
Saw similar error logs in the techsupport test. Please check if it's safe to ignore.
Dec 3 16:22:31.639554 str-msn2700-06 ERR syncd#SDK: [SAI_OBJECT.ERR] ./src/mlnx_sai_object.c[787]- mlnx_wait_for_bulk_read_event: Failed to wait for an event: Connection timed out.
Dec 3 16:22:31.639871 str-msn2700-06 ERR syncd#SDK: [SAI_PORT.ERR] ./src/mlnx_sai_port.c[6831]- mlnx_get_port_stats_ext: Failed to prepare bulk counter for port stats.
Dec 3 16:22:31.640102 str-msn2700-06 ERR syncd#SDK: :- collectData: Failed to get stats of Port Counter 0x1003900000001: -1
Description
When Techsupport tried to move and compress very large coredumps, we can briefly see page allocation failures for processes when they to try allocate larger memory blocks.
Steps to reproduce the issue:
Describe the results you received:
Page alllocation failures might be seen in syncd during this time.
Specifically check the line, when the sdk tried to allocate bulk buffer, there weren't any pages of sizes (256kb+) available because the techsupport process had to copy and compress these huge coredump files.
Triage
This is not a common event since core-dumps doesn't tend to be this big. However, the ulimit is set to unlimited in sonic and thus there is no restriction on how large the dumps can be and the page allocation failure might happen to any process which is using huge pages.
Also, since techsupport can be trigged in the background by auto-techsupport and not just manually, I think the coredumps size has to be restricted to a reasonable value
Describe the results you expected:
No Problems because of techsupport running
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
sysdump_sonic_dump_r-anaconda-15_20230413_073825.tar.gz