sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
746 stars 1.44k forks source link

Running show techsupport on devices with large core files might crash device when /tmp is on tmpfs #20950

Open assrinivasan opened 1 week ago

assrinivasan commented 1 week ago

Description

The show_techsupport/test_auto_techsupport.py::TestAutoTechSupport::test_max_limit[core] test creates huge core files. When /tmp is on tmpfs and available memory is low, it crashes the device.

Steps to reproduce the issue:

  1. Set /tmp folder to be tmpfs
  2. Run tests/show_techsupport/test_auto_techsupport.py::TestAutoTechSupport::test_max_limit[core] on KVM

Describe the results you received:

Filesystem Information and Free Memory During test progression:

Filesystem      Size  Used Avail Use% Mounted on
udev            1.9G     0  1.9G   0% /dev
tmpfs           385M   17M  369M   5% /run
root-overlay     16G  9.0G  6.5G  59% /
/dev/vda3        16G  9.0G  6.5G  59% /host
tmpfs           1.9G  1.3G  638M  67% /tmp
/dev/loop1      3.9G  5.0M  3.7G   1% /var/log
tmpfs           1.9G   16K  1.9G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           4.0M     0  4.0M   0% /sys/fs/cgroup
               total        used        free      shared  buff/cache   available
Mem:            3845        3750         107        1306        1531          95
Swap:              0           0           0

The available memory is exhausted when large core files are created, leading to a system crash. This causes the DUT to be unreachable:

27/11/2024 06:24:39 __init__._fixture_generator_decorator    L0099 ERROR  | 
Host unreachable in the inventory
Traceback (most recent call last):
  File "/var/src/sonic-mgmt/tests/common/plugins/log_section_start/__init__.py", line 95, in _fixture_generator_decorator
    next(it)
  File "/var/src/sonic-mgmt/tests/show_techsupport/test_auto_techsupport.py", line 125, in global_rate_limit_zero
    set_auto_techsupport_global(self.duthost, rate_limit=DEFAULT_RATE_LIMIT_GLOBAL)
  File "/var/src/sonic-mgmt/tests/show_techsupport/test_auto_techsupport.py", line 564, in set_auto_techsupport_global
    duthost.shell(cmd)
  File "/var/src/sonic-mgmt/tests/common/devices/multi_asic.py", line 135, in _run_on_asics
    return getattr(self.sonichost, self.multi_asic_attr)(*module_args, **complex_args)
  File "/var/src/sonic-mgmt/tests/common/devices/base.py", line 105, in _run
    res = self.module(*module_args, **complex_args)[self.hostname]
  File "/usr/local/lib/python3.8/dist-packages/pytest_ansible/module_dispatcher/v213.py", line 232, in _run
    raise AnsibleConnectionFailure(
pytest_ansible.errors.AnsibleConnectionFailure: Host unreachable in the inventory

Describe the results you expected:

show techsupport to pass

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

Related issue: https://github.com/sonic-net/sonic-buildimage/issues/15051

prabhataravind commented 11 hours ago

Need a way for "show tech" to be aware of the resources on the system.

assrinivasan commented 10 hours ago

show techsupport could generate sonic dumps in /var/tmp which is on disk, as opposed to /tmp which could be tmpfs. This would resolve the issue. @prabhataravind @prgeor @saiarcot895