sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
736 stars 1.42k forks source link

[202405] [Chassis]: Ports take too long to come up due to delayed port up notification processing by orchagent #19569

Closed mannytaheri closed 1 month ago

mannytaheri commented 4 months ago

Issue Description

The ports take more the 20 minutes to come up due to the delayed port up notification processing by orchagent after reload/reboot in T2 topo.

Results you see

The port up notifications are queued due to lot of bgp route (34000 routes) updates and take a long time. This occurs after a config reload or a reboot.

Results you expected to see

The bgp routes update should be handled correctly and ports should come up in a reasonable time.

Is it platform specific

generic

Relevant log output

No response

Output of show version

admin@ixre-egl-board15:~$ show ver

SONiC Software Version: SONiC.HEAD.742851-nokia-master-ef3457c7
SONiC OS Version: 12
Distribution: Debian 12.5
Kernel: 6.1.0-11-2-amd64
Build commit: ef3457c7
Build date: Fri Jun 14 19:31:02 UTC 2024
Built by: gitlab-runner@sonic-build-server04

Attach files (if any)

No response

arlakshm commented 3 months ago

Hi @mannytaheri, Can you please patch these 2 PR and see if the port comes up faster https://github.com/sonic-net/sonic-host-services/pull/135 https://github.com/sonic-net/sonic-buildimage/pull/19482

wenyiz2021 commented 3 months ago

@mannytaheri please let me know the test result with the patches, if not working I'll look at

deepak-singhal0408 commented 3 months ago

PMON faster bring up does not seem to help this issue. @wenyiz2021 could you help follow up with Arvind/BRCM and try out the BRCM fix

judyjoseph commented 3 months ago

Related issues raised earlier ? : https://github.com/sonic-net/sonic-buildimage/issues/17180

judyjoseph commented 3 months ago

To further debug

  1. Why this issue is seen newly in master/202405 -- not in 202205
  2. Is the ports coming up delayed/faster, coming around same time as BGP route updates
  3. Any new dockers/systemd added causing dependency
  4. eventd logs, can we disable evend docker ?

Check with sonic-common-infra subgroup, for a root cause which could be known.

judyjoseph commented 3 months ago

The FIB suppress pending feature got merged recently, can we check again with latest master.202405 build https://github.com/sonic-net/sonic-buildimage/pull/19736 @mannytaheri

wenyiz2021 commented 2 months ago

cannot reproduce issue on Arista chassis with latest master image with SAI 11.2 taken from https://github.com/sonic-net/sonic-buildimage/pull/19854

wenyiz2021 commented 2 months ago

@mannytaheri this seems not general issue for all platform? can you try above master image with SAI 11?

abdosi commented 2 months ago

image

abdosi commented 2 months ago

above is the understanding.

liuh-80 commented 2 months ago

Ack, will simulate this case and check if the issue related with by sonic-swss-common selectable priority.

liuh-80 commented 2 months ago

Here is update, today I create test case to simulate the case, here is my summary:

  1. Orchagent support 2 kinds of consumer: SubscriberStateTable CONFIG_DB,STATE_DB,CHASSIS_APP_DB ConsumerStateTable All other database

    Because I don't know in chassis the BGP route and Port state using which table class, I create swss-common test case to simulate the issue and test both table class.

  2. Here is how my test case work: Step1: Create port consumer with priority 45 Step2: Create route consumer with priority 5 Step3: Set DEFAULT_POP_BATCH_SIZE to 128 Step3: Create 1 port event Step4: Create 10000 route event Step5: start pop port and route event, in the middle of handle route event, create new port event and check if the port event pop immediately

  3. ConsumerStateTable Not found issue Every pop will pop 128 route, if there is new port event, it will pop first

  4. SubscriberStateTable Found issue: The batch size parameter of SubscriberStateTable does not work:

        SubscriberStateTable route_consumer(&consumer_db, routeTableName, DEFAULT_POP_BATCH_SIZE, 5);
    
    The SubscriberStateTable will always pop all route data, which means orchagent will not handle new incoming port event until it finish process all route event.

I'm not sure if the performance issue caused by this, checking about the database name and table name of port event and route event.

saksarav-nokia commented 2 months ago

@liuh-80 , In chassis, the following is the code path and it seems to be using ConsumerStateTable

const int routeorch_pri = 5; vector route_tables = { { APP_ROUTE_TABLE_NAME, routeorch_pri }, { APP_LABEL_ROUTE_TABLE_NAME, routeorch_pri } }; gRouteOrch = new RouteOrch(m_applDb, route_tables, gSwitchOrch, gNeighOrch, gIntfsOrch, vrf_orch, gFgNhgOrch, gSrv6Orch);

RouteOrch::RouteOrch(DBConnector db, vector &tableNames, SwitchOrch switchOrch, NeighOrch neighOrch, IntfsOrch intfsOrch, VRFOrch vrfOrch, FgNhgOrch fgNhgOrch, Srv6Orch *srv6Orch) : gRouteBulker(sai_route_api, gMaxBulkSize), gLabelRouteBulker(sai_mpls_api, gMaxBulkSize), gNextHopGroupMemberBulker(sai_next_hop_group_api, gSwitchId, gMaxBulkSize), Orch(db, tableNames), { }

Orch::Orch(DBConnector *db, const vector &tableNames_with_pri) { for (const auto& it : tableNames_with_pri) { addConsumer(db, it.first, it.second); } }

void Orch::addConsumer(DBConnector *db, string tableName, int pri) { if (db->getDbId() == CONFIG_DB || db->getDbId() == STATE_DB || db->getDbId() == CHASSIS_APP_DB) { addExecutor(new Consumer(new SubscriberStateTable(db, tableName, TableConsumable::DEFAULT_POP_BATCH_SIZE, pri), this, tableName)); } else { addExecutor(new Consumer(new ConsumerStateTable(db, tableName, gBatchSize, pri), this, tableName)); } }

liuh-80 commented 2 months ago

@liuh-80 , In chassis, the following is the code path and it seems to be using ConsumerStateTable

const int routeorch_pri = 5; vector route_tables = { { APP_ROUTE_TABLE_NAME, routeorch_pri }, { APP_LABEL_ROUTE_TABLE_NAME, routeorch_pri } }; gRouteOrch = new RouteOrch(m_applDb, route_tables, gSwitchOrch, gNeighOrch, gIntfsOrch, vrf_orch, gFgNhgOrch, gSrv6Orch);

RouteOrch::RouteOrch(DBConnector db, vector &tableNames, SwitchOrch switchOrch, NeighOrch neighOrch, IntfsOrch intfsOrch, VRFOrch vrfOrch, FgNhgOrch fgNhgOrch, Srv6Orch *srv6Orch) : gRouteBulker(sai_route_api, gMaxBulkSize), gLabelRouteBulker(sai_mpls_api, gMaxBulkSize), gNextHopGroupMemberBulker(sai_next_hop_group_api, gSwitchId, gMaxBulkSize), Orch(db, tableNames), { }

Orch::Orch(DBConnector *db, const vector &tableNames_with_pri) { for (const auto& it : tableNames_with_pri) { addConsumer(db, it.first, it.second); } }

void Orch::addConsumer(DBConnector *db, string tableName, int pri) { if (db->getDbId() == CONFIG_DB || db->getDbId() == STATE_DB || db->getDbId() == CHASSIS_APP_DB) { addExecutor(new Consumer(new SubscriberStateTable(db, tableName, TableConsumable::DEFAULT_POP_BATCH_SIZE, pri), this, tableName)); } else { addExecutor(new Consumer(new ConsumerStateTable(db, tableName, gBatchSize, pri), this, tableName)); } }

@saksarav-nokia , thanks, the issue need more investigation, I will try reproduce first.

saksarav-nokia commented 2 months ago

@liuh-80 , We can easily reproduce this in our setup. Let me know if you want us to collect any info or logs?

liuh-80 commented 2 months ago

@liuh-80 , We can easily reproduce this in our setup. Let me know if you want us to collect any info or logs?

@saksarav-nokia , can you share me the reproduce steps, OS version and hardware SKU?

saksarav-nokia commented 2 months ago

admin@ixre-egl-board211:~$ show version

SONiC Software Version: SONiC.HEAD.798897-202405-3192720893 SONiC OS Version: 12 Distribution: Debian 12.6 Kernel: 6.1.0-11-2-amd64 Build commit: 3192720893 Build date: Thu Aug 15 09:35:12 UTC 2024 Built by: gitlab-runner@wfrv-sonicbld05

Platform: x86_64-nokia_ixr7250e_36x400g-r0 HwSKU: Nokia-IXR7250E-36x400G ASIC: broadcom ASIC Count: 2

liuh-80 commented 2 months ago

admin@ixre-egl-board211:~$ show version

SONiC Software Version: SONiC.HEAD.798897-202405-3192720893 SONiC OS Version: 12 Distribution: Debian 12.6 Kernel: 6.1.0-11-2-amd64 Build commit: 3192720893 Build date: Thu Aug 15 09:35:12 UTC 2024 Built by: gitlab-runner@wfrv-sonicbld05

Platform: x86_64-nokia_ixr7250e_36x400g-r0 HwSKU: Nokia-IXR7250E-36x400G ASIC: broadcom ASIC Count: 2

What's the commands I need to run to create BGP routes and port up event? also what's the signal of BGP up event blocked by BGP routes, do I need check syslog?

saksarav-nokia commented 2 months ago

@liuh-80 , We have 36 ebgp neighbors and 6 ibgp neighbors with 34 routes from each ebgp neighbor. We just reboot this Line card to see the issue. ash: q: command not found admin@ixre-egl-board211:~$ show ip bgp summary -d all

IPv4 Unicast Summary: asic0: BGP router identifier 8.0.0.24, local AS number 65100 vrf-id 0 BGP table version 501903 asic1: BGP router identifier 8.0.0.26, local AS number 65100 vrf-id 0 BGP table version 1594975 RIB entries 205222, using 39402624 bytes of memory Peers 32, using 23743232 KiB of memory Peer groups 8, using 512 bytes of memory

Neighbhor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd NeighborName


3.3.3.24 4 65100 16567 16535 0 0 0 04:32:04 442664 ASIC0 3.3.3.26 4 65100 8943 8947 0 0 0 04:32:04 442659 ASIC1 3.3.3.36 4 65100 5880 8946 0 0 0 04:32:03 3098 ixre-egl-board212-ASIC0 3.3.3.36 4 65100 6663 16515 0 0 0 04:35:35 3098 ixre-egl-board212-ASIC0 3.3.3.38 4 65100 6072 8946 0 0 0 04:32:03 3098 ixre-egl-board212-ASIC1 3.3.3.38 4 65100 6855 16515 0 0 0 04:35:35 3098 ixre-egl-board212-ASIC1 10.0.0.1 4 65200 5711 5908 0 0 0 04:31:58 34050 ARISTA01T3 10.0.0.5 4 65200 5709 5903 0 0 0 04:31:51 34050 ARISTA03T3 10.0.0.9 4 65200 5709 5903 0 0 0 04:31:51 34050 ARISTA05T3 10.0.0.13 4 65200 5707 5898 0 0 0 04:31:45 34050 ARISTA07T3 10.0.0.17 4 65200 5707 5898 0 0 0 04:31:45 34050 ARISTA09T3 10.0.0.21 4 65200 5712 5909 0 0 0 04:32:02 34050 ARISTA11T3 10.0.0.23 4 65200 5711 5908 0 0 0 04:32:00 34050 ARISTA12T3 10.0.0.25 4 65200 5713 5914 0 0 0 04:32:05 34050 ARISTA13T3 10.0.0.27 4 65200 5713 5914 0 0 0 04:32:05 34049 ARISTA14T3 10.0.0.29 4 65200 5712 5909 0 0 0 04:32:01 34050 ARISTA15T3 10.0.0.31 4 65200 5711 5908 0 0 0 04:32:00 34050 ARISTA16T3 10.0.0.33 4 65200 5713 5914 0 0 0 04:32:05 34050 ARISTA17T3 10.0.0.35 4 65200 5713 5914 0 0 0 04:32:05 34050 ARISTA18T3 10.0.0.37 4 65200 6200 6612 0 0 0 04:56:25 34050 ARISTA19T3 10.0.0.41 4 65200 6206 6618 0 0 0 04:56:44 34050 ARISTA21T3 10.0.0.45 4 65200 6200 6612 0 0 0 04:56:25 34050 ARISTA23T3 10.0.0.49 4 65200 6202 6613 0 0 0 04:56:30 34050 ARISTA25T3 10.0.0.53 4 65200 6200 6612 0 0 0 04:56:25 34050 ARISTA27T3 10.0.0.57 4 65200 6206 6618 0 0 0 04:56:43 34049 ARISTA29T3 10.0.0.59 4 65200 6267 6680 0 0 0 04:59:48 34050 ARISTA30T3 10.0.0.61 4 65200 6206 6618 0 0 0 04:56:43 34049 ARISTA31T3 10.0.0.63 4 65200 6204 6616 0 0 0 04:56:38 34049 ARISTA32T3 10.0.0.65 4 65200 6204 6616 0 0 0 04:56:39 34049 ARISTA33T3 10.0.0.67 4 65200 6204 6616 0 0 0 04:56:37 34049 ARISTA34T3 10.0.0.69 4 65200 6247 6659 0 0 0 04:58:46 34050 ARISTA35T3 10.0.0.71 4 65200 6283 6695 0 0 0 05:00:35 34049 ARISTA36T3

Total number of neighbors 32

dmin@ixre-egl-board211:~$ admin@ixre-egl-board211:~$ show interface status -d all Interface Lanes Speed MTU FEC Alias Vlan Oper Admin Type Asym PFC


  Ethernet0          72,73,74,75,76,77,78,79     400G   9100    N/A   Ethernet1/1   PortChannel102      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
  Ethernet8          80,81,82,83,84,85,86,87     400G   9100    N/A   Ethernet2/1   PortChannel102      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
 Ethernet16          88,89,90,91,92,93,94,95     400G   9100    N/A   Ethernet3/1   PortChannel104      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
 Ethernet24      96,97,98,99,100,101,102,103     400G   9100    N/A   Ethernet4/1   PortChannel104      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
 Ethernet32  104,105,106,107,108,109,110,111     400G   9100    N/A   Ethernet5/1   PortChannel106      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
 Ethernet40  112,113,114,115,116,117,118,119     400G   9100    N/A   Ethernet6/1   PortChannel106      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
 Ethernet48  120,121,122,123,124,125,126,127     400G   9100    N/A   Ethernet7/1   PortChannel108      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
 Ethernet56  128,129,130,131,132,133,134,135     400G   9100    N/A   Ethernet8/1   PortChannel108      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
 Ethernet64  136,137,138,139,140,141,142,143     400G   9100    N/A   Ethernet9/1  PortChannel1010      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
 Ethernet72          64,65,66,67,68,69,70,71     400G   9100    N/A  Ethernet10/1  PortChannel1010      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
 Ethernet80          56,57,58,59,60,61,62,63     400G   9100    N/A  Ethernet11/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
 Ethernet88          48,49,50,51,52,53,54,55     400G   9100    N/A  Ethernet12/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
 Ethernet96          40,41,42,43,44,45,46,47     400G   9100    N/A  Ethernet13/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet104          32,33,34,35,36,37,38,39     400G   9100    N/A  Ethernet14/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet112          24,25,26,27,28,29,30,31     400G   9100    N/A  Ethernet15/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet120          16,17,18,19,20,21,22,23     400G   9100    N/A  Ethernet16/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet128            8,9,10,11,12,13,14,15     400G   9100    N/A  Ethernet17/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet136                  0,1,2,3,4,5,6,7     400G   9100    N/A  Ethernet18/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet144          72,73,74,75,76,77,78,79     400G   9100    N/A  Ethernet19/1  PortChannel1028      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet152          80,81,82,83,84,85,86,87     400G   9100    N/A  Ethernet20/1  PortChannel1028      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet160          88,89,90,91,92,93,94,95     400G   9100    N/A  Ethernet21/1  PortChannel1030      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet168      96,97,98,99,100,101,102,103     400G   9100    N/A  Ethernet22/1  PortChannel1030      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet176  104,105,106,107,108,109,110,111     400G   9100    N/A  Ethernet23/1  PortChannel1032      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet184  112,113,114,115,116,117,118,119     400G   9100    N/A  Ethernet24/1  PortChannel1032      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet192  120,121,122,123,124,125,126,127     400G   9100    N/A  Ethernet25/1  PortChannel1034      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet200  128,129,130,131,132,133,134,135     400G   9100    N/A  Ethernet26/1  PortChannel1034      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet208  136,137,138,139,140,141,142,143     400G   9100    N/A  Ethernet27/1  PortChannel1036      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet216          64,65,66,67,68,69,70,71     400G   9100    N/A  Ethernet28/1  PortChannel1036      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet224          56,57,58,59,60,61,62,63     400G   9100    N/A  Ethernet29/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet232          48,49,50,51,52,53,54,55     400G   9100    N/A  Ethernet30/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet240          40,41,42,43,44,45,46,47     400G   9100    N/A  Ethernet31/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet248          32,33,34,35,36,37,38,39     400G   9100    N/A  Ethernet32/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet256          24,25,26,27,28,29,30,31     400G   9100    N/A  Ethernet33/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet264          16,17,18,19,20,21,22,23     400G   9100    N/A  Ethernet34/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet272            8,9,10,11,12,13,14,15     400G   9100    N/A  Ethernet35/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off
Ethernet280                  0,1,2,3,4,5,6,7     400G   9100    N/A  Ethernet36/1           routed      up       up  QSFP-DD Double Density 8X Pluggable Transceiver         off

Ethernet-IB0 219 10G 9100 N/A Recirc0/0 routed up up N/A off Ethernet-IB1 219 10G 9100 N/A Recirc1/0 routed up up N/A off Ethernet-Rec0 220 10G 9100 N/A Recirc0/1 routed up up N/A off Ethernet-Rec1 220 10G 9100 N/A Recirc1/1 routed up up N/A off PortChannel102 N/A 800G 9100 N/A N/A routed up up N/A N/A PortChannel104 N/A 800G 9100 N/A N/A routed up up N/A N/A PortChannel106 N/A 800G 9100 N/A N/A routed up up N/A N/A PortChannel108 N/A 800G 9100 N/A N/A routed up up N/A N/A PortChannel1010 N/A 800G 9100 N/A N/A routed up up N/A N/A PortChannel1028 N/A 800G 9100 N/A N/A routed up up N/A N/A PortChannel1030 N/A 800G 9100 N/A N/A routed up up N/A N/A PortChannel1032 N/A 800G 9100 N/A N/A routed up up N/A N/A PortChannel1034 N/A 800G 9100 N/A N/A routed up up N/A N/A PortChannel1036 N/A 800G 9100 N/A N/A routed up up N/A N/A

saksarav-nokia commented 2 months ago

admin@ixre-egl-board211:~$ show version SONiC Software Version: SONiC.HEAD.798897-202405-3192720893 SONiC OS Version: 12 Distribution: Debian 12.6 Kernel: 6.1.0-11-2-amd64 Build commit: 3192720893 Build date: Thu Aug 15 09:35:12 UTC 2024 Built by: gitlab-runner@wfrv-sonicbld05 Platform: x86_64-nokia_ixr7250e_36x400g-r0 HwSKU: Nokia-IXR7250E-36x400G ASIC: broadcom ASIC Count: 2

What's the commands I need to run to create BGP routes and port up event? also what's the signal of BGP up event blocked by BGP routes, do I need check syslog?

@liuh-80 , We have 36 front panel ports in the asic and connected to Arista VM. We have enabled bgp protocol in both our chassis and Arista VM which established bgp neighbor and the routes are injected from Arista vm.

saksarav-nokia commented 2 months ago

@mlok-nokia @fountzou for viz

liuh-80 commented 2 months ago

Update: I found something, however need reproduce to confirm.

Seems the issue caused by following code:

void Consumer::execute()

void Consumer::execute() { // ConsumerBase::execute_impl(); SWSS_LOG_ENTER();

size_t update_size = 0;
auto table = static_cast<swss::ConsumerTableBase *>(getSelectable());
do
{
    std::deque<KeyOpFieldsValuesTuple> entries;
    table->pops(entries);
    update_size = addToSync(entries);
} while (update_size != 0);

drain();

}

Here is my theory:

  1. when there are 10000+ routes incoming, the consumer will be selected in following code: ''' void OrchDaemon::start() { ......

    while (true) { Selectable *s; int ret;

    ret = m_select->select(&s, SELECT_TIMEOUT); <== route consumer been selected here
    
    ......
    
    auto *c = (Executor *)s;
    c->execute();
    
    /* After each iteration, periodically check all m_toSync map to
     * execute all the remaining tasks that need to be retried. */
    
    /* TODO: Abstract Orch class to have a specific todo list */
    for (Orch *o : m_orchList)
        o->doTask();

'''

  1. The the route consumer will start execute() method.

  2. Inside the execute() method, there was a loop, the pops method will pop 128 entry, then the addToSync will return a none zero value, which will case the table pops again: ''' auto table = static_cast<swss::ConsumerTableBase *>(getSelectable()); do { std::deque entries; table->pops(entries); <== pop 128 routes by default update_size = addToSync(entries); <== return none zero value } while (update_size != 0); <== none zero update_size will cause table pops again '''

  3. Because there are 10000+ routes in the table, the code actually block here, and port notification will never selected untill all routes finish.

I modify test case to simulate this case, seems the while loop do cause the issue: '''

define DEFAULT_POP_BATCH_SIZE (128)

void ProducerStateTableSet(ProducerStateTable &table, string key) { vector fields; FieldValueTuple t("test_field", "test_value"); fields.push_back(t); table.set(key, fields); }

TEST(Priority, massive_route_block_portstatus) { std::string routeTableName = "route_table"; std::string portTableName = "port_table";

DBConnector producer_db("TEST_DB", 0, true);
DBConnector consumer_db("TEST_DB", 0, true);

ProducerStateTable route_producer(&producer_db, routeTableName);
ProducerStateTable port_producer(&producer_db, portTableName);

ConsumerStateTable route_consumer(&consumer_db, routeTableName, DEFAULT_POP_BATCH_SIZE, 5);
ConsumerStateTable port_consumer(&consumer_db, portTableName, DEFAULT_POP_BATCH_SIZE, 40);

Select selector;
Selectable *selected;

selector.addSelectable(&route_consumer);
selector.addSelectable(&port_consumer);

// create 1 route table event
ProducerStateTableSet(port_producer, "port_up_01");

int ROUTE_COUNT = 1000;
for (int route_idx = 0; route_idx < ROUTE_COUNT; route_idx++)
{
    ProducerStateTableSet(route_producer, "bgp_route_" + to_string(route_idx));
}

// simulate 
selector.select(&selected);
EXPECT_EQ(selected, &port_consumer);

{
    std::deque<KeyOpFieldsValuesTuple> ports;
    port_consumer.pops(ports);
    EXPECT_EQ(ports.size(), 1);
    while (!ports.empty())
    {
        KeyOpFieldsValuesTuple port = ports.front();
        auto key = kfvKey(port);
        cout << key << endl;
        ports.pop_front();
    }
}

int poped_entry = 0;
bool send_port_02 = false;
while (poped_entry < ROUTE_COUNT + 1)
{
    selector.select(&selected);
    cout << "seletcted" << endl;
    if (selected == &route_consumer)
    {
        int routes_count = 0;
        do
        {
            std::deque<KeyOpFieldsValuesTuple> routes;
            route_consumer.pops(routes);
            poped_entry += (int)(routes.size());
            routes_count = (int)(routes.size());
            cout << "poped " << routes.size() << " routes" << endl;
            while (!routes.empty())
            {
                KeyOpFieldsValuesTuple route = routes.front();
                auto key = kfvKey(route);
                //cout << key << endl;
                routes.pop_front();
            }

            if (!send_port_02 && poped_entry >= 500)
            {
                cout << "create new port status" << endl;
                ProducerStateTableSet(port_producer, "port_up_02");
                send_port_02 = true;
            }
        }
        while (routes_count > 0);
    }
    else if(selected == &port_consumer)
    {
        std::deque<KeyOpFieldsValuesTuple> ports;
        port_consumer.pops(ports);
        cout << "poped " << ports.size() << " ports" << endl;
        poped_entry += (int)(ports.size());
        while (!ports.empty())
        {
            KeyOpFieldsValuesTuple port = ports.front();
            auto key = kfvKey(port);
            cout << key << endl;
            ports.pop_front();
        }
    }
}

} '''

Test result: ''' [ RUN ] Priority.massive_route_block_portstatus port_up_01 seletcted poped 128 routes poped 128 routes poped 128 routes poped 128 routes create new port status poped 128 routes poped 128 routes poped 128 routes poped 104 routes poped 0 routes seletcted poped 1 ports port_up_02 [ OK ] Priority.massive_route_block_portstatus (139 ms) '''

wenyiz2021 commented 2 months ago

@liuh-80 thanks so much for the investigation. just curious, in this theory, does number of bgp neighbors matters? asking because I was not able to reproduce this with 4 bgp neighbors, each adv 32k routes. the links are 400G

liuh-80 commented 2 months ago

@liuh-80 thanks so much for the investigation. just curious, in this theory, does number of bgp neighbors matters? asking because I was not able to reproduce this with 4 bgp neighbors, each adv 32k routes. the links are 400G

I'm not understand how the BGP neighbors handled by orchagent, so not sure if the BGP neighbors related with this issue. We need test on hardware to confirm this is the root cause. I will prepare an image to verify the fix

liuh-80 commented 2 months ago

Found the issue may cause by this change: https://github.com/sonic-net/sonic-swss/commit/92589789aa79bf1e70937a35cb06eff8a358ab6b#diff-96451cb89f907afccbd39ddadb6d30aa21fe6fbd01b1cbaf6362078b926f1f08

Create a draft fix to verify the change is root cause: https://github.com/sonic-net/sonic-swss/pull/3269

abdosi commented 2 months ago

Being looked into actviely within MSFT.

mlok-nokia commented 2 months ago

Found the issue may cause by this change: sonic-net/sonic-swss@9258978#diff-96451cb89f907afccbd39ddadb6d30aa21fe6fbd01b1cbaf6362078b926f1f08

Create a draft fix to verify the change is root cause: sonic-net/sonic-swss#3269

@liuh-80 I built an image (latest master) with this change and tested. For the first time boot up after installation on a single linecard, all ports come up in 8 minutes and all 34k routes are also installed. For subsequent reboot a single linecard, it takes about 7 minutes for all linkup and 34k routes installed. It seems this change addresses the issue. We need to do more testing to verify that, includes the OC testing.

liuh-80 commented 2 months ago

Since the change verified can fix the issue, I published it for review and get comments:

https://github.com/sonic-net/sonic-swss/pull/3269

saksarav-nokia commented 1 month ago

I am looking at the orchagent crashes seen with this fix.

liuh-80 commented 1 month ago

Close because fix PR merged.