sonic-net / sonic-swss-common

Common components for SONiC switch state service
Other
46 stars 265 forks source link

Too many FD events on modifying PORT table in CONFIG_DB #653

Closed prgeor closed 2 years ago

prgeor commented 2 years ago

Xcvrd listens to configuration changes here. Steps to reproduce:

  1. Set any field in PORT table of CONFIG_DB as:- redis-cli -n 4 hset "PORT|Ethernet0" "tx_power" "-13"

  2. Add a print here to output the 'fvp' values and you find more than 6 events are received by Xcvrd

Jul 22 20:05:09.335091 sonic WARNING pmon#xcvrd[204620]: $$$handle_port_update_even(): {'admin_status': 'down', 'alias': 'Ethernet1/1', 'index': '1', 'lanes': '1,2,3,4,5,6,7,8', 'laser_freq': '193100', 'speed': '400000', 'tx_power': '-13', 'description': '', 'oper_status': 'up', 'mtu': '9100'}
Jul 22 20:05:09.342073 sonic WARNING pmon#xcvrd[204620]: message repeated 6 times: [ $$$handle_port_update_even(): {'admin_status': 'down', 'alias': 'Ethernet1/1', 'index': '1', 'lanes': '1,2,3,4,5,6,7,8', 'laser_freq': '193100', 'speed': '400000', 'tx_power': '-13', 'description': '', 'oper_status': 'up', 'mtu': '9100'}]

Clearly there are duplicates events pushed to Xcvrd which can impact performance of datapath initliazation especially during 400G link bring up sequence.

prgeor commented 2 years ago

@qiluo-msft, @hui-ma please check this issue

prsunny commented 2 years ago

Is this a regression?

liuh-80 commented 2 years ago

ACT, will start investigate this issue.

liuh-80 commented 2 years ago

Can't reproduce this issue with latest master branch image because pmon docker failed with following error:

Jul 25 05:01:20.160934 vlab-01 ERR pmon#chassis_db_init: Failed to load chassis due to ModuleNotFoundError("No module named 'sonic_platform'")

liuh-80 commented 2 years ago

Xcvrd can't start on latest master branch vs image:

root@vlab-01:/# xcvrd Failed to load sfputil: Failed to load platform module 'sfputil': [Errno 2] No such file or directory: '/usr/share/sonic/platform/plugins/sfputil.py'

liuh-80 commented 2 years ago

The issue can be reproduce with following code on multi-asic KVM:

from sonic_py_common import daemon_base from sonic_py_common import multi_asic from sonic_py_common.interface import backplane_prefix, inband_prefix, recirc_prefix from swsscommon import swsscommon

SELECT_TIMEOUT_MSECS = 1000

swsscommon.SonicDBConfig.initializeGlobalConfig() db = daemon_base.db_connect('APPL_DB', namespace='asic0') port_tbl = swsscommon.SubscriberStateTable(db, swsscommon.APP_PORT_TABLE_NAME) sel = swsscommon.Select() sel.addSelectable(port_tbl)

while True: (state, _) = sel.select(SELECT_TIMEOUT_MSECS) if state == swsscommon.Select.TIMEOUT: continue if state != swsscommon.Select.OBJECT: continue while True: (key, op, fvp) = port_tbl.pop() if not key: break print("dev event: key-{} op={} fvp={}".format(key, op, fvp))

    if not key:
        break
    if not validate_port(key):
        continue
    fvp = dict(fvp) if fvp is not None else {}
    if 'index' not in fvp:
        fvp['index'] = '-1'
    port_index = int(fvp['index'])
    port_change_event = None
    if op == swsscommon.SET_COMMAND:
        print("set event: {}".format(fvp))
    elif op == swsscommon.DEL_COMMAND:
        print("dev event: {}".format(fvp))

And change config_db with: sonic-db-cli -n asic0 CONFIG_DB HSET "PORT|Ethernet0" "key" "field"

There will be multiple events:

set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'} set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'} set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'} set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'} set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'} set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'} set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'} set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'} set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'} set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'} set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'} set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'}

However, when change APPL_DB with following code: admin@vlab-08:~$ sonic-db-cli -n asic0 APPL_DB HSET "PORT_TABLE:Ethernet0" "key" "field2"

There will only 1 event:

dev event: Ethernet0 SET (('admin_status', 'up'), ('alias', 'Ethernet1/1'), ('asic_port_name', 'Eth0-ASIC0'), ('description', 'ARISTA01T2:Ethernet1'), ('index', '0'), ('lanes', '1,2,3,4'), ('mtu', '9100'), ('pfc_asym', 'off'), ('role', 'Ext'), ('speed', '40000'), ('tpid', '0x8100'), ('oper_status', 'up'), ('key', 'field')) dev event: Ethernet0 SET (('admin_status', 'up'), ('alias', 'Ethernet1/1'), ('asic_port_name', 'Eth0-ASIC0'), ('description', 'ARISTA01T2:Ethernet1'), ('index', '0'), ('lanes', '1,2,3,4'), ('mtu', '9100'), ('pfc_asym', 'off'), ('role', 'Ext'), ('speed', '40000'), ('tpid', '0x8100'), ('oper_status', 'up'), ('key', 'field2')) <-- only receive 1 event.

So because the demo code only listen to APPL_DB, but the command are modify CONFIG_DB, so seems other application modify APPL_DB multiple times and cause this issue.

swss-common works OK because only receive 1 event for APPL_DB change.

liuh-80 commented 2 years ago

Will continue check where is the code modify CONFIG_DB multiple times.

liuh-80 commented 2 years ago

Seems it's done by https://github.com/Azure/sonic-swss/blob/7126857c1e387784bd04b9f7502018b2a4973d85/cfgmgr/portmgr.cpp

This issue seems a design issue:

  1. port manager will monitor config_db port table here: https://github.com/Azure/sonic-swss/blob/92589789aa79bf1e70937a35cb06eff8a358ab6b/cfgmgr/portmgrd.cpp#L55

  2. when config_db port table changed, portmgrd will write same entry multiple times: https://github.com/Azure/sonic-swss/blob/7126857c1e387784bd04b9f7502018b2a4973d85/cfgmgr/portmgr.cpp#L127

        for (auto &entry : field_values)
        {
            **writeConfigToAppDb(alias, fvField(entry), fvValue(entry));**
            SWSS_LOG_NOTICE("Configure %s %s to %s", alias.c_str(), fvField(entry).c_str(), fvValue(entry).c_str());
        }
    
        if (!portOk)
        {
            SWSS_LOG_INFO("Port %s is not ready, pending...", alias.c_str());
    
            **writeConfigToAppDb(alias, "mtu", mtu);
            writeConfigToAppDb(alias, "admin_status", admin_status);**
            field_values.clear();
            field_values.emplace_back("mtu", mtu);
            field_values.emplace_back("admin_status", admin_status);
            it->second = KeyOpFieldsValuesTuple{alias, SET_COMMAND, field_values};
            it++;
            continue;
        }

In my case, the entry contains 11 fields, so the code write same entry 11 times, then xvcrd receive 11 events: set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'}

Also here are syslog show portmgrd update same entry multiples:

        for (auto &entry : field_values)
        {
            writeConfigToAppDb(alias, fvField(entry), fvValue(entry));
            SWSS_LOG_NOTICE("Configure %s %s to %s", alias.c_str(), fvField(entry).c_str(), fvValue(entry).c_str()); <== here is the log code.
        }

Jul 26 08:40:22.122569 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 alias to Ethernet1/4 Jul 26 08:40:22.128114 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 asic_port_name to Eth3-ASIC0 Jul 26 08:40:22.135013 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 description to Ethernet1/4 Jul 26 08:40:22.135013 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 index to 3 Jul 26 08:40:22.135013 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 lanes to 13,14,15,16 Jul 26 08:40:22.135724 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 pfc_asym to off Jul 26 08:40:22.135724 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 role to Ext Jul 26 08:40:22.135724 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 speed to 40000 Jul 26 08:40:22.135860 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 tpid to 0x8100

prgeor commented 2 years ago

@liuh-80 let me verify and get back with the sonic version

prsunny commented 2 years ago

Seems it's done by https://github.com/Azure/sonic-swss/blob/7126857c1e387784bd04b9f7502018b2a4973d85/cfgmgr/portmgr.cpp

This issue seems a design issue:

  1. port manager will monitor config_db port table here: https://github.com/Azure/sonic-swss/blob/92589789aa79bf1e70937a35cb06eff8a358ab6b/cfgmgr/portmgrd.cpp#L55
  2. when config_db port table changed, portmgrd will write same entry multiple times: https://github.com/Azure/sonic-swss/blob/7126857c1e387784bd04b9f7502018b2a4973d85/cfgmgr/portmgr.cpp#L127

        for (auto &entry : field_values)
        {
            **writeConfigToAppDb(alias, fvField(entry), fvValue(entry));**
            SWSS_LOG_NOTICE("Configure %s %s to %s", alias.c_str(), fvField(entry).c_str(), fvValue(entry).c_str());
        }
    
        if (!portOk)
        {
            SWSS_LOG_INFO("Port %s is not ready, pending...", alias.c_str());
    
            **writeConfigToAppDb(alias, "mtu", mtu);
            writeConfigToAppDb(alias, "admin_status", admin_status);**
            field_values.clear();
            field_values.emplace_back("mtu", mtu);
            field_values.emplace_back("admin_status", admin_status);
            it->second = KeyOpFieldsValuesTuple{alias, SET_COMMAND, field_values};
            it++;
            continue;
        }

In my case, the entry contains 11 fields, so the code write same entry 11 times, then xvcrd receive 11 events: set event: {'admin_status': 'up', 'alias': 'Ethernet1/1', 'asic_port_name': 'Eth0-ASIC0', 'description': 'ARISTA01T2:Ethernet1', 'index': '0', 'lanes': '1,2,3,4', 'mtu': '9100', 'pfc_asym': 'off', 'role': 'Ext', 'speed': '40000', 'tpid': '0x8100', 'oper_status': 'up', 'key': 'field'}

Also here are syslog show portmgrd update same entry multiples:

        for (auto &entry : field_values)
        {
            writeConfigToAppDb(alias, fvField(entry), fvValue(entry));
            SWSS_LOG_NOTICE("Configure %s %s to %s", alias.c_str(), fvField(entry).c_str(), fvValue(entry).c_str()); <== here is the log code.
        }

Jul 26 08:40:22.122569 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 alias to Ethernet1/4 Jul 26 08:40:22.128114 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 asic_port_name to Eth3-ASIC0 Jul 26 08:40:22.135013 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 description to Ethernet1/4 Jul 26 08:40:22.135013 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 index to 3 Jul 26 08:40:22.135013 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 lanes to 13,14,15,16 Jul 26 08:40:22.135724 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 pfc_asym to off Jul 26 08:40:22.135724 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 role to Ext Jul 26 08:40:22.135724 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 speed to 40000 Jul 26 08:40:22.135860 vlab-08 NOTICE swss0#portmgrd: :- doTask: Configure Ethernet12 tpid to 0x8100

We did look at it earlier. The question is why multiple updates are coming for a single change in config_db. portmgrd is writing to app_db. I don't think xcvrd is complaining about app_db update. Related to portsyncd, it is only during the first time init, portsyncd reads from config_db and write to app_db.

liuh-80 commented 2 years ago

@prsunny , regarding to your question: The question is why multiple updates are coming for a single change in config_db

In the description of this issue, actually the print been added to this method: def handle_port_update_event(sel, asic_context, stop_event, logger, port_change_event_handler): Add a print here to output the 'fvp' values

So, the multiple update event is from appl_db, not from config_db, following code subscript the event:

          def subscribe_port_update_event(namespaces):
              port_tbl_map = [
                  {'APPL_DB': swsscommon.APP_PORT_TABLE_NAME},
                  {'STATE_DB': 'TRANSCEIVER_INFO'},
                  {'STATE_DB': 'PORT_TABLE'},
              ]
              ......

Also code here show the evenr is from appl_db: https://github.com/sonic-net/sonic-platform-daemons/blob/e3b03d4dda09c20cf96692a30b569986429ed2d4/sonic-xcvrd/xcvrd/xcvrd.py#L1235

APPL_DB for CONFIG updates, and STATE_DB for insertion/removal

    sel, asic_context = port_mapping.subscribe_port_update_event(self.namespaces)
    while not self.task_stopping_event.is_set():
        # Handle port change event from main thread
        port_mapping.handle_port_update_event(sel,
                                              asic_context,
                                              self.task_stopping_event,
                                              helper_logger,
                                              self.on_port_update_event)

Here is why single config_db change will trigger multiple appl_db event:

  1. Following command will change config_db port table: redis-cli -n 4 hset "PORT|Ethernet0" "tx_power" "-13" This command is the single change.

  2. Then the portmgrd are listening config_db port table set operation, and sync the change to appl_db: https://github.com/Azure/sonic-swss/blob/7126857c1e387784bd04b9f7502018b2a4973d85/cfgmgr/portmgr.cpp#L82

    However, because the code issue in this method, a single change 1 step1 will trigger multiple change to appl_db:
    
        for (auto &entry : field_values)
        {
            writeConfigToAppDb(alias, fvField(entry), fvValue(entry)); <== multiple change to application DB
            SWSS_LOG_NOTICE("Configure %s %s to %s", alias.c_str(), fvField(entry).c_str(), fvValue(entry).c_str());
        }
  3. Then xvcrd are subscribe appl_db change here: https://github.com/sonic-net/sonic-platform-daemons/blob/e3b03d4dda09c20cf96692a30b569986429ed2d4/sonic-xcvrd/xcvrd/xcvrd_utilities/port_mapping.py#L108

      def subscribe_port_update_event(namespaces):
          port_tbl_map = [
              {'APPL_DB': swsscommon.APP_PORT_TABLE_NAME},
              {'STATE_DB': 'TRANSCEIVER_INFO'},
              {'STATE_DB': 'PORT_TABLE'},
          ]

    So xvcrd will get multiple change from appl_db.

prsunny commented 2 years ago

thanks @liuh-80 for the explanation. Original issue was reported as multiple notifications for single update on config_db. Regarding app_db, it seems to be expected and @prgeor , i think this issue can be closed and not applicable to swss-common.

liuh-80 commented 2 years ago

Close this issue because not a swss-common issue.