sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
735 stars 1.41k forks source link

[sonic-snmpagent] AgentX TCP Connection is being terminated when blocking=True arg is set #10310

Open vivekrnv opened 2 years ago

vivekrnv commented 2 years ago

Description

When blocking=True is used and the data is not available in Redis, the corresponding data-fetching coroutines are eating up time and not giving enough time for the coroutine which maintains the TCP connection to AgentX Socket and thus the connection is getting terminated and eventually causing the failure of SNMP queries.

This SNMP query failure is also reported here: https://github.com/Azure/sonic-buildimage/issues/9996

Triage:

Mar 18 13:01:01.171667 qa-eth-vt05-1-2410 INFO snmp#snmp-subagent [ax_interface] INFO: Connection loop starting...
Mar 18 13:01:01.171667 qa-eth-vt05-1-2410 INFO snmp#snmp-subagent [ax_interface] INFO: Attempting AgentX socket bind...
Mar 18 13:05:02.917957 qa-eth-vt05-1-2410 INFO snmp#snmp-subagent [ax_interface] INFO: AgentX socket connection established. Initiating opening handshake...
Mar 18 13:06:03.310344 qa-eth-vt05-1-2410 INFO snmp#snmp-subagent [ax_interface] INFO: Sending open...
Mar 18 13:07:03.917140 qa-eth-vt05-1-2410 INFO snmp#snmp-subagent [ax_interface] INFO: AgentX session starting with ID: 8
Mar 18 13:08:04.081422 qa-eth-vt05-1-2410 INFO snmp#/supervisord: snmp-subagent socket.send() raised exception.
Mar 18 13:08:04.093848 qa-eth-vt05-1-2410 INFO snmp#snmp-subagent [ax_interface] INFO: AgentX socket connection closed.
Mar 18 13:08:04.094200 qa-eth-vt05-1-2410 ERR snmp#snmp-subagent [ax_interface] ERROR: [Errno 32] Broken pipe

It clearly took 4 mins for the connection_routine to finish TCP handshake, and so the same behavior is expected when the Transport coroutine has to handle and respond to any incoming data. https://github.com/Azure/sonic-snmpagent/blob/master/src/ax_interface/socket_io.py#L149

I've verified this behavior by removing the Updater Instances which are throwing the following exceptions,

Mar 18 13:05:02.871674 qa-eth-vt05-1-2410 ERR snmp#snmp-subagent [ax_interface] ERROR: MIBUpdater.start() caught an unexpected exception during update_data()#012Traceback (most recent call last):#012  File "/usr/local/lib/python3.7/dist-packages/ax_interface/mib.py", line 37, in start#012    self.reinit_data()#012  File "/usr/local/lib/python3.7/dist-packages/sonic_ax_impl/mibs/ietf/rfc2863.py", line 128, in reinit_data#012    self.vlan_oid_name_map = Namespace.get_sync_d_from_all_namespace(mibs.init_sync_d_vlan_tables, self.db_conn)#012  File "/usr/local/lib/python3.7/dist-packages/sonic_ax_impl/mibs/__init__.py", line 651, in get_sync_d_from_all_namespace#012    ns_tuple = per_namespace_func(db_conn)#012  File "/usr/local/lib/python3.7/dist-packages/sonic_ax_impl/mibs/__init__.py", line 341, in init_sync_d_vlan_tables#012    vlan_name_map = port_util.get_vlan_interface_oid_map(db_conn)#012  File "/usr/local/lib/python3.7/dist-packages/swsssdk/port_util.py", line 167, in get_vlan_interface_oid_map#012    rif_name_map = db.get_all('COUNTERS_DB', 'COUNTERS_RIF_NAME_MAP', blocking=True)#012  File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 1751, in get_all#012    return dict(super(SonicV2Connector, self).get_all(db_name, _hash, blocking))#012  File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 1708, in get_all#012    return _swsscommon.SonicV2Connector_Native_get_all(self, db_name, _hash, blocking)#012RuntimeError: Key '{COUNTERS_RIF_NAME_MAP}' unavailable in database '{COUNTERS_DB}'

and the snmp queries started to work.

Solution:

This PR https://github.com/Azure/sonic-snmpagent/pull/246 fixes the issue temporarily but as a long term solution all the blocking=True arguments in the subagent repo should be avoided.

sonic_dump_qa-eth-vt05-1-2410_20220318_131013 (1).tar.gz

vivekrnv commented 2 years ago

@qiluo-msft, @SuvarnaMeenakshi Please check

zhangyanzhao commented 2 years ago

Mitigated for now, long term fix may require a new feature: by default, make ALL the blocking calls as False

liat-grozovik commented 2 years ago

@qiluo-msft, @SuvarnaMeenakshi kindly reminder to review

qiluo-msft commented 2 years ago

The proposed solution seems in good direction. It should not be extreme easy because existing code has some assumption on redis data availability. Would you like to raise a PR on this solution?

qiluo-msft commented 2 years ago

We fixed one of the blocking call, but not all. https://github.com/Azure/sonic-snmpagent/pull/255