sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
739 stars 1.43k forks source link

[boot performance] SAI discovery process running after switch creation in fast/warm boot causes delays #13768

Open stepanblyschak opened 1 year ago

stepanblyschak commented 1 year ago

Description

Steps to reproduce the issue:

  1. Perform fast/warm-reboot

Observe that after create_switch() SAI discovery process runs and takes (in this case 1.02 sec):

Feb 10 11:38:20.926013 r-panther-13 NOTICE syncd#SDK: :- discover: discover took 0.203495 sec
Feb 10 11:38:20.926309 r-panther-13 NOTICE syncd#SDK: :- discover: discovered objects count: 1386
Feb 10 11:38:20.926489 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_PORT: 33
Feb 10 11:38:20.926597 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_VIRTUAL_ROUTER: 1
Feb 10 11:38:20.926722 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_STP: 1
Feb 10 11:38:20.926823 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_HOSTIF_TRAP_GROUP: 1
Feb 10 11:38:20.926943 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_QUEUE: 512
Feb 10 11:38:20.927045 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_SCHEDULER_GROUP: 512
Feb 10 11:38:20.927165 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP: 256
Feb 10 11:38:20.927267 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_HASH: 2
Feb 10 11:38:20.927387 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_SWITCH: 1
Feb 10 11:38:20.927520 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_VLAN: 1
Feb 10 11:38:20.927711 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_VLAN_MEMBER: 32
Feb 10 11:38:20.927813 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_BRIDGE: 1
Feb 10 11:38:20.928017 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_BRIDGE_PORT: 33
Feb 10 11:38:20.928882 r-panther-13 NOTICE syncd#SDK: :- helperSaveDiscoveredObjectsToRedis: objects in ASIC state table present: 0
Feb 10 11:38:20.929008 r-panther-13 NOTICE syncd#SDK: :- helperSaveDiscoveredObjectsToRedis: putting ALL discovered objects to redis
Feb 10 11:38:21.601662 r-panther-13 NOTICE syncd#SDK: :- helperSaveDiscoveredObjectsToRedis: save discovered objects to redis took 0.673484 sec
Feb 10 11:38:21.602082 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x1 to Asic View and COLDVIDS
Feb 10 11:38:21.602592 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x100000026 to Asic View and COLDVIDS
Feb 10 11:38:21.603029 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x10 to Asic View and COLDVIDS
Feb 10 11:38:21.603480 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x3 to Asic View and COLDVIDS
Feb 10 11:38:21.603693 r-panther-13 WARNING syncd#SDK: [SAI_UTILS.WARNING] mlnx_sai_utils.c[1691]- check_attribs_metadata: Not implemented attribute SAI_SWITCH_ATTR_DEFAULT_OVERRIDE_VIRTUAL_ROUTER_ID (vendor data not found)
Feb 10 11:38:21.603769 r-panther-13 WARNING syncd#SDK: [SAI_UTILS.WARNING] mlnx_sai_utils.c[2060]- sai_get_attributes: Failed attribs check, key:Switch ID 1
Feb 10 11:38:21.603861 r-panther-13 WARNING syncd#SDK: :- helperGetSwitchAttrOid: failed to get SAI_SWITCH_ATTR_DEFAULT_OVERRIDE_VIRTUAL_ROUTER_ID: SAI_STATUS_ATTR_NOT_IMPLEMENTED_0
Feb 10 11:38:21.604488 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x10010039 to Asic View and COLDVIDS
Feb 10 11:38:21.605191 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x11 to Asic View and COLDVIDS
Feb 10 11:38:21.608251 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x1c to Asic View and COLDVIDS
Feb 10 11:38:21.608999 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x10000001c to Asic View and COLDVIDS
Feb 10 11:38:21.738942 r-panther-13 NOTICE syncd#SDK: :- helperLoadColdVids: read 1386 COLD VIDS
Feb 10 11:38:21.739078 r-panther-13 NOTICE syncd#SDK: :- SaiSwitch: constructor took 1.018046 sec

Describe the results you received:

SAI discover process took 1.02 sec, but we have seen different results for different platforms/configurations (up to 4 sec).

Describe the results you expected:

From fast/warm reboot design standpoint performing a lot of GET operations in the middle of switch booting delays the replay of configuration. Syncd could blindly replay the configuration as fast as possible and then discover default objects afterwards.

Output of show version:

SONiC Software Version: SONiC.master.0-3c1c7e23b
Distribution: Debian 11.6
Kernel: 5.10.0-18-2-amd64
Build commit: 3c1c7e23b
Build date: Fri Feb 10 10:44:37 UTC 2023
Built by: stepanb@r-build-sonic03

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2020T04244
Model Number: MSN2700-CS2FO
Hardware Revision: A2
Uptime: 11:52:06 up 14 min,  1 user,  load average: 1.35, 1.14, 0.90
Date: Fri 10 Feb 2023 11:52:06

Docker images:
REPOSITORY                    TAG                  IMAGE ID       SIZE
docker-syncd-mlnx             latest               c74a06aaefef   775MB
docker-syncd-mlnx             master.0-3c1c7e23b   c74a06aaefef   775MB
docker-orchagent              latest               e5c15be2b372   385MB
docker-orchagent              master.0-3c1c7e23b   e5c15be2b372   385MB
docker-fpm-frr                latest               ca590f8456b1   403MB
docker-fpm-frr                master.0-3c1c7e23b   ca590f8456b1   403MB
docker-teamd                  latest               63ae7f6b11fe   374MB
docker-teamd                  master.0-3c1c7e23b   63ae7f6b11fe   374MB
docker-macsec                 latest               b8b8f1ec35f8   376MB
docker-platform-monitor       latest               8a99e1c0d338   778MB
docker-platform-monitor       master.0-3c1c7e23b   8a99e1c0d338   778MB
docker-eventd                 latest               aa19834be0ed   357MB
docker-eventd                 master.0-3c1c7e23b   aa19834be0ed   357MB
docker-dhcp-relay             latest               ffb03963b964   366MB
docker-sonic-p4rt             latest               e6a0d2d3030c   927MB
docker-sonic-p4rt             master.0-3c1c7e23b   e6a0d2d3030c   927MB
docker-snmp                   latest               224060c4595c   397MB
docker-snmp                   master.0-3c1c7e23b   224060c4595c   397MB
docker-sonic-telemetry        latest               0ff254ca62cf   655MB
docker-sonic-telemetry        master.0-3c1c7e23b   0ff254ca62cf   655MB
docker-lldp                   latest               513b87c9af84   399MB
docker-lldp                   master.0-3c1c7e23b   513b87c9af84   399MB
docker-database               latest               707b19896280   357MB
docker-database               master.0-3c1c7e23b   707b19896280   357MB
docker-mux                    latest               b22673d61bd1   405MB
docker-mux                    master.0-3c1c7e23b   b22673d61bd1   405MB
docker-router-advertiser      latest               b9dfac24aae3   357MB
docker-router-advertiser      master.0-3c1c7e23b   b9dfac24aae3   357MB
docker-nat                    latest               f2a2c73a6f56   351MB
docker-nat                    master.0-3c1c7e23b   f2a2c73a6f56   351MB
docker-sflow                  latest               9038485e9854   349MB
docker-sflow                  master.0-3c1c7e23b   9038485e9854   349MB
docker-sonic-mgmt-framework   latest               bec28867667e   477MB
docker-sonic-mgmt-framework   master.0-3c1c7e23b   bec28867667e   477MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

sonic_dump_r-panther-13_20230210_114940.tar.gz

kcudnik commented 1 year ago

discovery must be done right after switch creation to see what objects exists, cannot be delayed later

arfeigin commented 3 months ago

Hi @kcudnik,

We are working now on optimizations for fast-reboot flow for switches with high number of ports. We saw that for 256 ports SAI discover for each port consumes more than 8 seconds where in this time orchagent is idle and waiting syncd to finish creating ports. Is SAI discover on post port creation required in fast-reboot init flow? In fast-reboot flow there is no comparison logic since current view is empty. (https://github.com/sonic-net/SONiC/blob/4ab89a9fdba3ced17f4e4d7f97892f93045905d1/doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md#42-syncd-point-of-view---initapply-view-framework) We tried skipping SAI discover that follows ports creation, in fast-reboot flow (run the community fast-reboot test multiple times) on Nvidia platforms and at least in that case we saw that this saved 6.5~ seconds of dataplane down time which is more than 20% of the allowed disruption length. As well system was stable and no issues observed.