sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
728 stars 1.39k forks source link

[warm-reboot][scale] warm-reboot crash with 1K sub interfaces and 1K IPv4 and 1K IPv6 neighbors #19263

Open stepanblyschak opened 3 months ago

stepanblyschak commented 3 months ago

Description

Configuration:

Steps to reproduce the issue:

  1. sudo warm-reboot

Describe the results you received:

Timeout in neighsyncd

Apr 12 12:51:48.762520 qa-eth-vt16-5-4600va1 ERR swss#neighsyncd: :- main: neighbor table restore is not finished after timed-out, exit!!!

followed by an swss restart.

restore_neighbor logs (first and last message):

Apr 12 12:45:00.435457 sonic INFO swss#restore_neighbor: Add neighbor entries: family: IPv4, intf_idx: 2, ip: 10.7.184.1, mac: 00:00:5e:00:01:01
...
Apr 12 12:51:51.100192 qa-eth-vt16-5-4600va1 INFO swss#restore_neighbor: Add neighbor entries: family: IPv6, intf_idx: 898, ip: 2000:0:0:33c::2, mac: 00:02:00:00:00:03

It takes more than 6 min to restore all neighbors on 1K interfaces.

The performance issue appears to be in creation of socket and binding it to the right interface in restore_neighbor.py. A scenario with 1K IPv4 and 1K IPv6 neighbors works fine and restoration completes in under 2 min.

Describe the results you expected:

Output of show version:

SONiC Software Version: SONiC.202311_RC.19-df396f168_Internal
SONiC OS Version: 11
Distribution: Debian 11.9
Kernel: 5.10.0-23-2-amd64
Build commit: 3e849c519
Build date: Thu Mar 21 14:50:42 UTC 2024
Built by: sw-r2d2-bot@r-build-sonic-ci02-242

Platform: x86_64-mlnx_msn4600-r0
HwSKU: ACS-MSN4600
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2302XZ00GQ
Model Number: MSN4600-VS2FO
Hardware Revision: A1
Uptime: 12:50:58 up 6 min,  1 user,  load average: 4.06, 3.19, 1.56
Date: Fri 12 Apr 2024 12:50:58

Docker images:
REPOSITORY                                         TAG                               IMAGE ID       SIZE
docker-syncd-mlnx                                  202311_RC.19-df396f168_Internal   ad7dabef3e79   769MB
docker-syncd-mlnx                                  latest                            ad7dabef3e79   769MB
docker-platform-monitor                            202311_RC.19-df396f168_Internal   ff4de2c8a134   758MB
docker-platform-monitor                            latest                            ff4de2c8a134   758MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.7.0-202311-069                  237b4f99ad97   442MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/doai        1.2.0-202311-027                  8871ffa1de81   281MB
docker-orchagent                                   202311_RC.19-df396f168_Internal   3d8d27631784   339MB
docker-orchagent                                   latest                            3d8d27631784   339MB
docker-fpm-frr                                     202311_RC.19-df396f168_Internal   0c7e726219b6   359MB
docker-fpm-frr                                     latest                            0c7e726219b6   359MB
docker-nat                                         202311_RC.19-df396f168_Internal   4ae055be1086   330MB
docker-nat                                         latest                            4ae055be1086   330MB
docker-sflow                                       202311_RC.19-df396f168_Internal   a3a364ab6349   329MB
docker-sflow                                       latest                            a3a364ab6349   329MB
docker-teamd                                       202311_RC.19-df396f168_Internal   404b7f483fa2   327MB
docker-teamd                                       latest                            404b7f483fa2   327MB
docker-macsec                                      202311_RC.17-df396f168_Internal   e7a8339aff0d   330MB
docker-dhcp-relay                                  latest                            c44a0441d56a   310MB
docker-snmp                                        202311_RC.19-df396f168_Internal   d19d04496658   340MB
docker-snmp                                        latest                            d19d04496658   340MB
docker-eventd                                      202311_RC.19-df396f168_Internal   4660f7eeeddf   301MB
docker-eventd                                      latest                            4660f7eeeddf   301MB
docker-lldp                                        202311_RC.19-df396f168_Internal   907e55f6feaa   343MB
docker-lldp                                        latest                            907e55f6feaa   343MB
docker-mux                                         202311_RC.19-df396f168_Internal   20dce1baa74b   350MB
docker-mux                                         latest                            20dce1baa74b   350MB
docker-database                                    202311_RC.19-df396f168_Internal   c19ec3cdb7d8   301MB
docker-database                                    latest                            c19ec3cdb7d8   301MB
docker-sonic-gnmi                                  202311_RC.19-df396f168_Internal   d076ae802ce3   389MB
docker-sonic-gnmi                                  latest                            d076ae802ce3   389MB
docker-router-advertiser                           202311_RC.19-df396f168_Internal   ac0df650588e   301MB
docker-router-advertiser                           latest                            ac0df650588e   301MB
docker-sonic-mgmt-framework                        202311_RC.19-df396f168_Internal   cbf2c0368160   417MB
docker-sonic-mgmt-framework                        latest                            cbf2c0368160   417MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

prabhataravind commented 3 months ago

@vaibhavhd could you please take a look?

prabhataravind commented 3 months ago

@stepanblyschak Can you please confirm that the issue is not seen if you don't have 1k subinterfaces with 1k ipv4/ipv6 neighbors?

stepanblyschak commented 3 months ago

@prabhataravind The issue is not seen when I have 1k ipv4/ipv6 neighbors on 1 or 2 RIFs. So having a lot of RIFs is a problem.