sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
741 stars 1.44k forks source link

chassis: chassisd process on on LC crashes when database-chassis restarts on Sup #15486

Open anamehra opened 1 year ago

anamehra commented 1 year ago

Description

chassisd process on LC crahses when database-chassis goes down on supervisor as part of sonic-mgmt tests like restart docker.service, Sup watchdog reboot, etc. This generates a core Jun 14 21:07:19.267491 sfd-lt2-lc0 INFO pmon#supervisord: chassisd Traceback (most recent call last): Jun 14 21:07:19.267514 sfd-lt2-lc0 INFO pmon#supervisord: chassisd File "/usr/local/bin/chassisd", line 471, in Jun 14 21:07:19.267514 sfd-lt2-lc0 INFO pmon#supervisord: chassisd main() Jun 14 21:07:19.267514 sfd-lt2-lc0 INFO pmon#supervisord: chassisd File "/usr/local/bin/chassisd", line 466, in main Jun 14 21:07:19.267525 sfd-lt2-lc0 INFO pmon#supervisord: chassisd chassisd.run() Jun 14 21:07:19.267532 sfd-lt2-lc0 INFO pmon#supervisord: chassisd File "/usr/local/bin/chassisd", line 445, in run Jun 14 21:07:19.267532 sfd-lt2-lc0 INFO pmon#supervisord: chassisd self.module_updater.module_db_update() Jun 14 21:07:19.267549 sfd-lt2-lc0 INFO pmon#supervisord: chassisd File "/usr/local/bin/chassisd", line 264, in module_db_update Jun 14 21:07:19.267549 sfd-lt2-lc0 INFO pmon#supervisord: chassisd self.asic_table.set(asic_key, asic_fvs) Jun 14 21:07:19.267568 sfd-lt2-lc0 INFO pmon#supervisord: chassisd File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 2237, in set Jun 14 21:07:19.267833 sfd-lt2-lc0 INFO pmon#supervisord: chassisd return _swsscommon.Table_set(self, *args) Jun 14 21:07:19.267833 sfd-lt2-lc0 INFO pmon#supervisord: chassisd RuntimeError: RedisError: Failed to redisGetReply in RedisPipeline::pop, err=1: errstr=Connection reset by peer Jun 14 21:07:19.275527 sfd-lt2-lc0 INFO pmon#supervisord: chassisd terminate called after throwing an instance of 'swss::RedisError' Jun 14 21:07:19.275527 sfd-lt2-lc0 INFO pmon#supervisord: chassisd what(): RedisError: Failed to redisGetReply in RedisPipeline::pop, err=1: errstr=Connection reset by peer Jun 14 21:07:19.275675 sfd-lt2-lc0 INFO pmon#supervisord: chassisd Jun 14 21:07:19.632037 sfd-lt2-lc0 INFO pmon#supervisord 2023-06-14 21:07:19,631 INFO exited: chassisd (terminated by SIGABRT (core dumped); not expected) Jun 14 21:07:20.333821 sfd-lt2-lc0 INFO bgp2#supervisord 2023-06-14 21:07:20,333 INFO waiting for supervisor-proc-exit-listener, rsyslogd, staticd, zebra, bgpd, bgpcfgd to die Jun 14 21:07:20.633609 sfd-lt2-lc0 INFO pmon#supervisord 2023-06-14 21:07:20,633 INFO spawned: 'chassisd' with pid 447

Steps to reproduce the issue:

1. 2. 3.

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

anamehra commented 1 year ago

@abdosi , FYI.

prgeor commented 1 year ago

@arlakshm can you take a look

arlakshm commented 1 year ago

HI @anamehra, this is expected behavior if the database-chassis is not running then any process trying to write to chassis-db will exit.

anamehra commented 1 year ago

HI @anamehra, this is expected behavior if the database-chassis is not running then any process trying to write to chassis-db will exit.

Hi @arlakshm , this caused a chassisd core and failed a sonic-mgmt test case. How should we handle this in sonic-mgmt? In one scenario we saw that the chassisd on LC keeps restarting while the chassis redis server was down and entered FATAL state as it keeps exiting too soon.

amulyan7 commented 1 year ago

@arlakshm To handle this use case, can we explore the option of handling the redis error, and retry connection with a defined retry count/timeout?

abdosi commented 4 months ago

@anamehra is this issue still applicable ? What is latest on this ?