sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
737 stars 1.43k forks source link

[chassis] Too many open files error and unable to connect to redis socket error #10870

Closed SuvarnaMeenakshi closed 2 years ago

SuvarnaMeenakshi commented 2 years ago

Description

Steps to reproduce the issue:

  1. Boot up chassis and leave SNMP service running for a few days (seeing issue after 5 days of service being up)
  2. Observe the below logs in syslog in both Supervisor and Linecards.
    • Unable to connect to redis
    • Too many files open
  3. No functionality issue seen

Describe the results you received:

Error logs seen

 snmp#snmp-subagent [ax_interface] ERROR: MIBUpdater.start() caught an unexpected exception during update_data()#012Traceback (most recent call last):#012  File "/usr/local/lib/python3.7/dist-packages/ax_interface/mib.py", line 37, in start#012    self.reinit_data()#012  File "/usr/local/lib/python3.7/dist-packages/sonic_ax_impl/mibs/ietf/rfc4363.py", line 51, in reinit_data#012    mibs.init_sync_d_interface_tables, self.db_conn#012  File "/usr/local/lib/python3.7/dist-packages/sonic_ax_impl/mibs/__init__.py", line 653, in get_sync_d_from_all_namespace#012    ns_tuple = per_namespace_func(db_conn)#012  File "/usr/local/lib/python3.7/dist-packages/sonic_ax_impl/mibs/__init__.py", line 274, in init_sync_d_interface_tables#012    if_name_map_util, if_id_map_util = port_util.get_interface_oid_map(db_conn, blocking=False)#012  File "/usr/local/lib/python3.7/dist-packages/swsssdk/port_util.py", line 68, in get_interface_oid_map#012    db.connect('COUNTERS_DB')#012  File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 1672, in connect#012    return _swsscommon.SonicV2Connector_Native_connect(self, db_name, retry_on)#012RuntimeError: Unable to connect to redis (unix-socket): Cannot assign requested address

 snmp#snmp-subagent [ax_interface] ERROR: MIBUpdater.start() caught an unexpected exception during update_data()#012Traceback (most recent call last):#012  File "/usr/local/lib/python3.7/dist-packages/ax_interface/mib.py", line 43, in start#012    self.update_data()#012  File "/usr/local/lib/python3.7/dist-packages/sonic_ax_impl/mibs/vendor/__init__.py", line 57, in update_data#012    cpu_util = psutil.cpu_percent()#012  File "/usr/local/lib/python3.7/dist-packages/psutil/__init__.py", line 1755, in cpu_percent#012    _last_cpu_times = cpu_times()#012  File "/usr/local/lib/python3.7/dist-packages/psutil/__init__.py", line 1616, in cpu_times#012    return _psplatform.cpu_times()#012  File "/usr/local/lib/python3.7/dist-packages/psutil/_pslinux.py", line 604, in cpu_times#012    with open_binary('%s/stat' % procfs_path) as f:#012  File "/usr/local/lib/python3.7/dist-packages/psutil/_common.py", line 711, in open_binary#012    return open(fname, "rb", **kwargs)#012OSError: [Errno 24] Too many open files: '/proc/stat'

 ERR snmp#snmp-subagent [ax_interface] ERROR: MIBUpdater.start() caught an unexpected exception during update_data()#012Traceback (most recent call last):#012  File "/usr/local/lib/python3.7/dist-packages/ax_interface/mib.py", line 37, in start#012    self.reinit_data()#012  File "/usr/local/lib/python3.7/dist-packages/sonic_ax_impl/mibs/ietf/rfc1213.py", line 715, in reinit_data#012    self.db_conn.connect(self.db_conn.CONFIG_DB)#012  File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 1672, in connect#012    return _swsscommon.SonicV2Connector_Native_connect(self, db_name, retry_on)#012RuntimeError: Unable to connect to redis: Cannot assign requested address

ERR snmp#snmp-subagent [ax_interface] ERROR: MIBUpdater.start() caught an unexpected exception during update_data()#012Traceback (most recent call last):#012  File "/usr/local/lib/python3.7/dist-packages/ax_interface/mib.py", line 43, in start#012    self.update_data()#012  File "/usr/local/lib/python3.7/dist-packages/sonic_ax_impl/mibs/ietf/rfc4292.py", line 55, in update_data#012    front_ns = multi_asic.get_all_namespaces()['front_ns']#012  File "/usr/local/lib/python3.7/dist-packages/sonic_py_common/multi_asic.py", line 239, in get_all_namespaces#012    num_asics = get_num_asics()#012  File "/usr/local/lib/python3.7/dist-packages/sonic_py_common/multi_asic.py", line 119, in get_num_asics#012    with open(asic_conf_file_path) as asic_conf_file:#012OSError: [Errno 24] Too many open files: '/usr/share/sonic/platform/asic.conf'

Describe the results you expected:

Should not see any errors in syslog

Output of show version:

With master branch.

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

SuvarnaMeenakshi commented 2 years ago

@abdosi fyi

abdosi commented 2 years ago

seen on other platforms also

liuh-80 commented 2 years ago

Get object list with multi-asic VS:

{"list": 550, "dict": 3186, "tuple": 2608, "coroutine": 46, "frame": 45, "Future": 22, "TimerHandle": 21, "Context": 92, "FutureIter": 22, "TaskWakeupMethWrapper": 22, "scputimes": 12, "svmem": 1, "wrapper_descriptor": 1184, "builtin_function_or_method": 1588, "method_descriptor": 947, "classmethod_descriptor": 27, "getset_descriptor": 925, "member_descriptor": 382, "weakref": 1435, "staticmethod": 180, "module": 276, "SimpleNamespace": 1, "function": 6527, "type": 831, "ModuleSpec": 273, "property": 547, "classmethod": 193, "cell": 529, "frozenset": 101, "FileFinder": 31, "set": 169, "SourceFileLoader": 212, "CodecInfo": 2, "ABCMeta": 78, "TextIOWrapper": 3, "IncrementalDecoder": 1, "IncrementalEncoder": 2, "BufferedReader": 1, "FileIO": 3, "BufferedWriter": 2, "_Environ": 2, "Quitter": 2, "_Printer": 3, "_Helper": 1, "Repr": 1, "method": 163, "itemgetter": 37, "mappingproxy": 1, "WeakKeyDictionary": 2, "count": 3, "EnumMeta": 54, "DynamicClassAttribute": 2, "OrderedDict": 60, "_NamedIntConstant": 74, "RegexFlag": 15, "_lru_cache_wrapper": 7, "JSONEncoder": 1, "JSONDecoder": 1, "Scanner": 1, "ExtensionFileLoader": 17, "_TemplateMetaclass": 1, "method-wrapper": 2, "WeakSet": 4, "_MainThread": 1, "Event": 6, "Formatter": 2, "WeakValueDictionary": 2, "_StderrHandler": 1, "Condition": 2, "PercentStyle": 2, "deque": 8, "RootLogger": 1, "Manager": 1, "AddressFamily": 32, "SocketKind": 7, "MsgFlag": 14, "AddressInfo": 7, "Logger": 5, "NullHandler": 1, "Signals": 32, "Handlers": 2, "Sigmasks": 3, "_Feature": 10, "_SSLMethod": 6, "Options": 12, "AlertDescription": 27, "SSLErrorNumber": 9, "VerifyFlags": 5, "VerifyMode": 3, "TLSVersion": 7, "Purpose": 2, "_ASN1Object": 2, "StrictVersion": 1, "UUID": 4, "SafeUUID": 3, "PlaceHolder": 1, "_SendfileMode": 3, "_ParameterKind": 5, "_RunningLoop": 1, "Random": 1, "IPv4Address": 31, "IPv4Network": 19, "IPv6Network": 28, "ValueType": 13, "PduTypes": 19, "IPv6Address": 43, "RegisteredPDU": 13, "Reason": 6, "Errors": 32, "_WindowsFlavour": 1, "_PosixFlavour": 1, "_NormalAccessor": 1, "attrgetter": 2, "_NSEnum": 1, "methodcaller": 1, "partial": 7, "YAMLObjectMetaclass": 1, "SysLogHandler": 1, "socket": 4, "Load": 1, "Store": 1, "Del": 1, "AugLoad": 1, "AugStore": 1, "Param": 1, "And": 1, "Or": 1, "Add": 1, "Sub": 1, "Mult": 1, "MatMult": 1, "Div": 1, "Mod": 1, "Pow": 1, "LShift": 1, "RShift": 1, "BitOr": 1, "BitXor": 1, "BitAnd": 1, "FloorDiv": 1, "Invert": 1, "Not": 1, "UAdd": 1, "USub": 1, "Eq": 1, "NotEq": 1, "Lt": 1, "LtE": 1, "Gt": 1, "GtE": 1, "Is": 1, "IsNot": 1, "In": 1, "NotIn": 1, "_UnixDefaultEventLoopPolicy": 1, "hamt": 1, "hamt_bitmap_node": 1, "MIBMeta": 20, "_Local": 1, "LLDPRemoteTables": 12, "LLDPLocalChassis": 6, "LLDPLocalSystemDataUpdater": 1, "MIBEntry": 12, "LLDPRemTableUpdater": 1, "SubtreeMIBEntry": 92, "_UnixSelectorEventLoop": 1, "LocPortUpdater": 1, "LLDPLocManAddrUpdater": 1, "SonicV2Connector": 93, "EpollSelector": 1, "_SelectorMapping": 1, "SelectorKey": 2, "Handle": 4, "DbTables": 24, "IfTypes": 3, "ArpUpdater": 1, "NextHopUpdater": 1, "OverlayAdpaterMIBEntry": 24, "sysNameUpdater": 1, "PhysicalClass": 12, "FanInfoDB": 5, "FanDrawerInfoDB": 4, "PhysicalRelationInfoDB": 4, "PsuInfoDB": 8, "XcvrInfoDB": 6, "ThermalInfoDB": 2, "OidMIBEntry": 24, "PhysicalTableMIBUpdater": 1, "DbTables32": 4, "DbTables64": 8, "InterfaceMIBUpdater": 1, "RedisOidTreeUpdater": 2, "ThermalCacheUpdater": 1, "FanCacheUpdater": 1, "FanDrawerCacheUpdater": 1, "PsuCacheUpdater": 1, "XcvrCacheUpdater": 1, "EntitySensorDataScale": 17, "EntitySensorStatus": 3, "EntitySensorValueRange": 2, "PhysicalSensorTableMIBUpdater": 1, "RouteUpdater": 1, "FdbUpdater": 1, "NicDuplex": 3, "BatteryTime": 2, "_WrapNumbers": 1, "IOPriority": 4, "LLDPRemManAddrUpdater": 1, "InterfacesUpdater": 1, "EntitySensorDataType": 12, "Negsignal": 32, "Connections": 1, "SystemUtilizationHandler": 1, "BgpSessionUpdater": 1, "Agent": 1, "MIBTable": 1, "SocketManager": 1, "Task": 22, "_GatheringFuture": 1, "ThreadPoolExecutor": 1, "SimpleQueue": 1, "Thread": 1, "DirectionTypes": 2, "QueueStatUpdater": 1, "CHASSISInfoDB": 1, "PSUInfoDB": 2, "PowerStatusHandler": 1, "PubSub": 35, "Codec": 2, "ConfigDBConnector": 4, "AgentX": 1, "_SelectorSocketTransport": 1, "KeyedRef": 1, "PfcUpdater": 1, "PfcPrioUpdater": 1}

ANd there are 1000+ ''anon_inode:[eventpoll]' and 600+ 'socket' opened.

For epoll is related with PubSub, for socket it related with DBConnector, however the python object count is normal.

Also following test code also show similer issue: from swsscommon.swsscommon import SonicDBConfig SonicDBConfig.load_sonic_global_db_config() SonicDBConfig.get_ns_list() from swsscommon.swsscommon import SonicV2Connector db = SonicV2Connector(use_unix_socket_path=True, namespace='') db.connect("CONFIG_DB") db.get_redis_client("CONFIG_DB") db.connect("CONFIG_DB") db.get_redis_client("CONFIG_DB") client = db.get_redis_client("CONFIG_DB") client.pubsub() <== run multiple times the epool keeps increase but PubSub object destoryed

So, consider about the C++ object not destoryed by SWIG generated code. will add log to confirm this.

liuh-80 commented 2 years ago

This issue is because some method which return new object need declare as %newobject in SWIG: https://www.swig.org/Doc4.0/SWIGDocumentation.html#Customization_ownership

Verifyed on local devbox, will submit PR and validate if the fix can work with chassis

liuh-80 commented 2 years ago

The issue can be reproduce with following code: from swsscommon.swsscommon import SonicDBConfig from swsscommon.swsscommon import SonicV2Connector import gc

SonicDBConfig.load_sonic_global_db_config() SonicDBConfig.get_ns_list() db = SonicV2Connector(use_unix_socket_path=True, namespace='') db.connect("CONFIG_DB") db.get_redis_client("CONFIG_DB") client = db.get_redis_client("CONFIG_DB") client.pubsub() client.pubsub() client.pubsub() client.newConnector(0) client.newConnector(0) client.newConnector(0)

gc.collect()

And there will be 3 epoll and 4 socket opened in python. After fix this issue, ther will only 1 socket opened.

liuh-80 commented 2 years ago

Found another memory leak issue in swsscommon:

void ConfigDBConnector_Native::db_connect(string db_name, bool wait_for_init, bool retry_on) { m_db_name = db_name; m_key_separator = m_table_name_separator = get_db_separator(db_name); SonicV2Connector_Native::connect(m_db_name, retry_on);

if (wait_for_init)
{
    auto& client = get_redis_client(m_db_name);
    auto pubsub = client.pubsub(); <== this is a pointer but not delete.
SuvarnaMeenakshi commented 2 years ago

https://github.com/Azure/sonic-swss-common/pull/655 checked with the sonic-buildimage multi-asic VS image with your latest swss-common PR, I see that the fd count is not going above this range; the issue that we were seeing in SNMP is fixed with the fix

root@vlab-08:/# ls -l /proc/24/fd | wc -l 650 root@vlab-08:/# ls -l /proc/24/fd | wc -l 650 root@vlab-08:/# ls -l /proc/24/fd | wc -l 626 root@vlab-08:/# ls -l /proc/24/fd | wc -l 626