sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
722 stars 1.38k forks source link

[boot performance] Orchagent daemon starts >20 sec after SWSS container started #13765

Open stepanblyschak opened 1 year ago

stepanblyschak commented 1 year ago

Description

Orchagent daemon starts >20 sec after SWSS container started:

Feb  8 13:14:13.295576 r-spider-05 INFO systemd[1]: Starting switch state service...
Feb  8 13:14:16.147920 r-spider-05 INFO systemd[1]: Started switch state service.
Feb  8 13:14:37.219074 r-spider-05 NOTICE swss#orchagent: :- main: --- Starting Orchestration Agent ---

Steps to reproduce the issue:

  1. Cold/fast/warm reboot

Describe the results you received:

Systemd SWSS start script takes 2 sec:

Supervisor sleeps for 1 sec before spawning any processes:

Supervisor dependent-startup plugin starts processes:

Feb  8 13:14:13.295576 r-spider-05 INFO systemd[1]: Starting switch state service...
Feb  8 13:14:16.147920 r-spider-05 INFO systemd[1]: Started switch state service.
Feb  8 13:14:21.839345 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:20,032 INFO spawned: 'dependent-startup' with pid 30
Feb  8 13:14:21.839429 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:20,041 INFO spawned: 'supervisor-proc-exit-listener' with pid 31
Feb  8 13:14:21.839682 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:21,699 INFO spawned: 'rsyslogd' with pid 34
Feb  8 13:14:24.752618 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:24,750 INFO spawned: 'gearsyncd' with pid 38
Feb  8 13:14:24.764069 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:24,762 INFO spawned: 'portsyncd' with pid 39
Feb  8 13:14:24.781397 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:24,774 INFO spawned: 'containercfgd' with pid 40
Feb  8 13:14:27.798592 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:27,789 INFO spawned: 'orchagent' with pid 47
Feb  8 13:14:29.898832 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:29,887 INFO spawned: 'swssconfig' with pid 49
Feb  8 13:14:30.066083 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:30,065 INFO spawned: 'coppmgrd' with pid 50
Feb  8 13:14:32.460626 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:32,458 INFO spawned: 'restore_neighbors' with pid 56
Feb  8 13:14:32.483029 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:32,482 INFO spawned: 'arp_update' with pid 57
Feb  8 13:14:32.547472 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:32,546 INFO spawned: 'neighsyncd' with pid 58
Feb  8 13:14:32.603015 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:32,594 INFO spawned: 'wait_for_link' with pid 61
Feb  8 13:14:32.644283 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:32,632 INFO spawned: 'vlanmgrd' with pid 62
Feb  8 13:14:32.679490 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:32,670 INFO spawned: 'intfmgrd' with pid 65
Feb  8 13:14:32.735057 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:32,711 INFO spawned: 'portmgrd' with pid 67
Feb  8 13:14:32.791118 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:32,780 INFO spawned: 'buffermgrd' with pid 69
Feb  8 13:14:32.851638 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:32,825 INFO spawned: 'enable_counters' with pid 73
Feb  8 13:14:32.931876 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:32,923 INFO spawned: 'vrfmgrd' with pid 80
Feb  8 13:14:33.092733 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:33,091 INFO spawned: 'nbrmgrd' with pid 86
Feb  8 13:14:33.205972 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:33,162 INFO spawned: 'vxlanmgrd' with pid 89
Feb  8 13:14:33.355900 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:33,299 INFO spawned: 'fdbsyncd' with pid 96
Feb  8 13:14:33.588196 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:33,581 INFO spawned: 'tunnelmgrd' with pid 103
Feb  8 13:14:36.761280 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:36,755 INFO spawned: 'wait_for_link' with pid 237
Feb  8 13:14:37.219074 r-spider-05 NOTICE swss#orchagent: :- main: --- Starting Orchestration Agent ---
Feb  8 13:14:40.151694 r-spider-05 INFO swss#supervisord 2023-02-08 13:14:40,150 INFO spawned: 'ndppd' with pid 353

orchagent is configured to depend on portsyncd which depends on rsyslogd. Since supervisord has startsecs=1 by default for every process, there will be at least 2 seconds delay. Actually, rsyslogd is not guaranteed to be available in 1 sec, so this synchronization is weak. However, we can observe 3+3 sec delay before starting orchagent which imply possible supervisord_dependent_startup overhead. Rest of the time - around 10 sec we spend in /usr/bin/orchagent.sh before orchagent is started. Also, the delay is not consistent ranging from 15-22 sec in testing. This creates a lot of noise when trying to analyze other possibilities for optimizations.

Describe the results you expected:

SWSS orchagent is critical daemon. It is necessary to start it as soon as possible at boot to avoid any delays. Expected orchagent start imidiatelly after container is started.

Output of show version:

SONiC Software Version: SONiC.202211_RC4.1-a073b6554_Internal
Distribution: Debian 11.6
Kernel: 5.10.0-18-2-amd64
Build commit: a073b6554
Build date: Mon Feb  6 12:22:18 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci03-243

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2020T04244
Model Number: MSN2700-CS2FO
Hardware Revision: A2
Uptime: 11:02:11 up 8 min,  1 user,  load average: 0.30, 1.05, 0.76
Date: Fri 10 Feb 2023 11:02:11

Docker images:
REPOSITORY                                         TAG                               IMAGE ID       SIZE
docker-syncd-mlnx                                  202211_RC4.1-a073b6554_Internal   afb7892b15b6   916MB
docker-syncd-mlnx                                  latest                            afb7892b15b6   916MB
docker-orchagent                                   202211_RC4.1-a073b6554_Internal   a579f39c356c   529MB
docker-orchagent                                   latest                            a579f39c356c   529MB
docker-fpm-frr                                     202211_RC4.1-a073b6554_Internal   6e107ab059bd   540MB
docker-fpm-frr                                     latest                            6e107ab059bd   540MB
docker-teamd                                       202211_RC4.1-a073b6554_Internal   cd85e7fb0d5a   510MB
docker-teamd                                       latest                            cd85e7fb0d5a   510MB
docker-macsec                                      latest                            820e8d5fa373   512MB
docker-sonic-telemetry                             202211_RC4.1-a073b6554_Internal   5deb764e2549   791MB
docker-sonic-telemetry                             latest                            5deb764e2549   791MB
docker-platform-monitor                            202211_RC4.1-a073b6554_Internal   81727410e630   920MB
docker-platform-monitor                            latest                            81727410e630   920MB
docker-snmp                                        202211_RC4.1-a073b6554_Internal   72418bff1490   539MB
docker-snmp                                        latest                            72418bff1490   539MB
docker-dhcp-relay                                  latest                            9362ef77c572   502MB
docker-eventd                                      202211_RC4.1-a073b6554_Internal   326438dfb6fe   493MB
docker-eventd                                      latest                            326438dfb6fe   493MB
docker-lldp                                        202211_RC4.1-a073b6554_Internal   30e086e77134   535MB
docker-lldp                                        latest                            30e086e77134   535MB
docker-mux                                         202211_RC4.1-a073b6554_Internal   8f34ff38e6a1   541MB
docker-mux                                         latest                            8f34ff38e6a1   541MB
docker-database                                    202211_RC4.1-a073b6554_Internal   df28aac82383   493MB
docker-database                                    latest                            df28aac82383   493MB
docker-router-advertiser                           202211_RC4.1-a073b6554_Internal   7a03d5ac5d1c   493MB
docker-router-advertiser                           latest                            7a03d5ac5d1c   493MB
docker-sonic-p4rt                                  202211_RC4.1-a073b6554_Internal   c0808efe3ad0   575MB
docker-sonic-p4rt                                  latest                            c0808efe3ad0   575MB
docker-sonic-mgmt-framework                        202211_RC4.1-a073b6554_Internal   6fcac32351a8   611MB
docker-sonic-mgmt-framework                        latest                            6fcac32351a8   611MB
docker-nat                                         202211_RC4.1-a073b6554_Internal   6f5a251b1c76   480MB
docker-nat                                         latest                            6f5a251b1c76   480MB
docker-sflow                                       202211_RC4.1-a073b6554_Internal   ff6e122bd91e   478MB
docker-sflow                                       latest                            ff6e122bd91e   478MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.0.0-master-internal-108         6f0490e775d3   327MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/doroce      1.0.0-master-internal-11          9f27047097ac   200MB

Output of show techsupport:

Additional information you deem important (e.g. issue happens only occasionally):

sysdump_test_push_gate_reboot_policer.tar.gz

prgeor commented 1 year ago

@prsunny need improvement in fast-reboot

stepanblyschak commented 1 year ago

To clarify, this is a degradation comparing to older images, for example (time is different as taken from another platform):

202012:

Mar  9 15:57:40.792921 sonic INFO systemd[1]: Started switch state service.
Mar  9 15:57:48.824060 sonic NOTICE swss#orchagent: :- main: --- Starting Orchestration Agent ---

8 sec

202211:

Mar  9 16:07:14.097283 sonic INFO systemd[1]: Started switch state service.
Mar  9 16:07:26.429175 r-bulldog-02 NOTICE swss#orchagent: :- main: --- Starting Orchestration Agent ---

12 sec
stepanblyschak commented 1 year ago

@prsunny @vaibhavhd @yxieca I did some POC to measure improvements and I observed that while I am able to get orchagent start much quicker it does not have an affect on control plane restoration time. We must have both orchagent running and syncd running in order to start re-configuration, however, currently syncd is a limiting factor on a platform I am doing this test. Syncd starts even later than orchagent.

To conclude, I am holding on with orchagent improvements until I find an optimization path for syncd. Once we have syncd start sooner than orchagent we can proceed with orchagent optimizations.