sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
738 stars 1.43k forks source link

redis omem leaking issue on T2 supervisor #20680

Open sdszhang opened 1 week ago

sdszhang commented 1 week ago

Description

We are seeing memory leaking issue on T2 Supervisor when running nightly test, which caused the redis memory keeps increasing until it fails sanity_check in sonic-mgmt.

Following is one of the log which memory sanity_check threshold.

06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0303 INFO   | asic0 db memory over the threshold 
06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0304 INFO   | asic0 db memory omem non-zero output: 
id=1367 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=262 name= age=36559 idle=20085 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=29 omem=594616 tot-mem=597464 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1368 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=263 name= age=36559 idle=19157 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=15 omem=307560 tot-mem=310408 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1369 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=264 name= age=36559 idle=20717 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=57 omem=1168728 tot-mem=1171576 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1370 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=265 name= age=36559 idle=21043 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=464 omem=9513856 tot-mem=9516704 events=rw cmd=psubscribe user=default redir=-1 resp=2
06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0307 INFO   | Done checking database memory on svcstr2-8800-sup-1

06/10/2024 05:26:45 parallel.parallel_run                    L0221 INFO   | Completed running processes for target "_check_dbmemory_on_dut" in 0:00:02.809825 seconds
06/10/2024 05:26:45 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': True, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 11584760}]

The memory leaking was seen after running one of the following 3 modules. once the total_omem becomes non-zero, it will keep increasing until it's over the threshold.

system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py

Steps to reproduce the issue:

  1. Run full nightly test on a T2 testbed.

Describe the results you received:

testbed fails sanity check due to omem over threshold after running nightly test on T2 testbed.

Describe the results you expected:

redis omem should be released after usage, should not keep increasing.

Output of show version:

admin@svcstr2-8800-sup-1:~$ show version

SONiC Software Version: SONiC.jianquan.cicso.202405.08
SONiC OS Version: 12
Distribution: Debian 12.6
Kernel: 6.1.0-22-2-amd64
Build commit: b60548f2f6
Build date: Fri Nov  1 11:20:02 UTC 2024
Built by: azureuser@00df58e3c000000

Platform: x86_64-8800_rp-r0
HwSKU: Cisco-8800-RP
ASIC: cisco-8000
ASIC Count: 10
Serial Number: FOC2545N2CA
Model Number: 8800-RP
Hardware Revision: 1.0
Uptime: 00:50:48 up 14:22,  3 users,  load average: 13.28, 12.07, 11.65
Date: Mon 04 Nov 2024 00:50:48

Output of show techsupport:

When running system_health/test_system_health.py test: At the beginning of the test:

05/10/2024 23:23:32 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 0}]

At the end of the test:

06/10/2024 00:02:03 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 861168}]

This symptom is observed for all 3 test cases so far:

system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py

Additional information you deem important (e.g. issue happens only occasionally):

arlakshm commented 6 days ago

@anamehra, @abdosi, can you please help triage this issue