sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
723 stars 1.38k forks source link

[202305][warm-reboot]: BGP Graceful-Restart timeout is reached due to hostcfgd failed with graceful shutdown #19682

Open nazariig opened 1 month ago

nazariig commented 1 month ago

Signed-off-by: Nazarii Hnydyn nazariig@nvidia.com

Description

During warm-reboot shutdown phase, hostcfgd sometimes doesn't stop with SIGTERM signal which causes a significant delay (systemd default timeout before sending SIGKILL is 90 sec) that leads to BGP Graceful-Restart timeout (default value is 240 sec).

The issue is caused by missing application graceful shutdown: usage of sys.exit in signal handler leads to the situation when main process keeps running even when shutdown is requested

https://github.com/sonic-net/sonic-host-services/blob/202305/scripts/hostcfgd#L86

def signal_handler(sig, frame):
    if sig == signal.SIGHUP:
        syslog.syslog(syslog.LOG_INFO, "HostCfgd: signal 'SIGHUP' is caught and ignoring..")
    elif sig == signal.SIGINT:
        syslog.syslog(syslog.LOG_INFO, "HostCfgd: signal 'SIGINT' is caught and exiting...")
        sys.exit(128 + sig)
    elif sig == signal.SIGTERM:
        syslog.syslog(syslog.LOG_INFO, "HostCfgd: signal 'SIGTERM' is caught and exiting...")
        sys.exit(128 + sig)
    else:
        syslog.syslog(syslog.LOG_INFO, "HostCfgd: invalid signal - ignoring..")

Note: the bug happens during warm-upgrade from 202305 to 202311

Steps to reproduce the issue:

  1. Run the script
    root@sonic:/home/admin# cat test.sh
    #!/bin/bash
    while true; do
        systemctl reset-failed hostcfgd.service
        systemctl stop hostcfgd
        systemctl start hostcfgd
        sleep 0.5
    done

Describe the results you received:

Jul 24 16:47:15.622756 sonic INFO hostcfgd: ConfigDB connect success
Jul 24 16:47:15.631146 sonic INFO hostcfgd: Starting port listener
Jul 24 16:47:15.646962 sonic INFO hostcfgd: => Processing port events !!!
Jul 24 16:47:15.650531 sonic INFO hostcfgd: message repeated 36 times: [ => Processing port events !!! ]
Jul 24 16:47:15.650610 sonic INFO hostcfgd: Updating delayed features after port initialization
Jul 24 16:47:15.650700 sonic INFO hostcfgd: => Processing port events !!!
Jul 24 16:47:15.652455 sonic INFO hostcfgd: message repeated 21 times: [ => Processing port events !!! ]
Jul 24 16:47:15.662422 sonic INFO hostcfgd: Waiting for systemctl to finish initialization
Jul 24 16:47:15.687298 sonic INFO hostcfgd: systemctl has finished initialization -- proceeding ...
Jul 24 16:47:15.726972 sonic INFO hostcfgd: Reloading systemd configuration files ...
Jul 24 16:47:15.745326 sonic INFO hostcfgd: HostCfgd: signal 'SIGTERM' is caught and exiting...
Jul 24 16:47:15.803302 sonic NOTICE systemd[1]: hostcfgd.service: Main process exited, code=exited, status=143/n/a
Jul 24 16:47:15.803833 sonic WARNING systemd[1]: hostcfgd.service: Failed with result 'exit-code'.
Jul 24 16:47:16.325849 sonic INFO hostcfgd: ConfigDB connect success
Jul 24 16:47:16.328093 sonic INFO hostcfgd: Starting port listener
Jul 24 16:47:16.348831 sonic INFO hostcfgd: HostCfgd: signal 'SIGTERM' is caught and exiting...
Jul 24 16:47:16.378889 sonic INFO hostcfgd: Waiting for systemctl to finish initialization
Jul 24 16:47:16.398630 sonic INFO hostcfgd: => Processing port events !!!
Jul 24 16:47:16.402122 sonic INFO hostcfgd: message repeated 36 times: [ => Processing port events !!! ]
Jul 24 16:47:16.405227 sonic INFO hostcfgd: Updating delayed features after port initialization
Jul 24 16:47:16.405383 sonic INFO hostcfgd: => Processing port events !!!
Jul 24 16:47:16.407076 sonic INFO hostcfgd: message repeated 21 times: [ => Processing port events !!! ]
Jul 24 16:47:16.438893 sonic INFO hostcfgd: systemctl has finished initialization -- proceeding ...
Jul 24 16:48:46.375468 sonic WARNING systemd[1]: hostcfgd.service: State 'stop-sigterm' timed out. Killing.
Jul 24 16:48:46.375623 sonic NOTICE systemd[1]: hostcfgd.service: Killing process 83574 (hostcfgd) with signal SIGKILL.
Jul 24 16:48:46.375719 sonic NOTICE systemd[1]: hostcfgd.service: Killing process 83592 (hostcfgd) with signal SIGKILL.
Jul 24 16:48:46.385393 sonic WARNING systemd[1]: hostcfgd.service: Main process exited, code=killed, status=9/KILL
Jul 24 16:48:46.385545 sonic WARNING systemd[1]: hostcfgd.service: Failed with result 'timeout'.
Jul 24 16:48:46.965250 sonic INFO systemd[1]: hostcfgd.service: Succeeded.
Jul 24 16:48:47.541712 sonic INFO systemd[1]: hostcfgd.service: Succeeded.
Jul 24 16:48:48.049211 sonic INFO hostcfgd: ConfigDB connect success
Jul 24 16:48:48.056175 sonic INFO hostcfgd: Starting port listener
Jul 24 16:48:48.071460 sonic INFO hostcfgd: => Processing port events !!!
Jul 24 16:48:48.075282 sonic INFO hostcfgd: message repeated 36 times: [ => Processing port events !!! ]
Jul 24 16:48:48.075355 sonic INFO hostcfgd: Updating delayed features after port initialization
Jul 24 16:48:48.075439 sonic INFO hostcfgd: => Processing port events !!!
Jul 24 16:48:48.079506 sonic INFO hostcfgd: message repeated 21 times: [ => Processing port events !!! ]
Jul 24 16:48:48.086933 sonic INFO hostcfgd: Waiting for systemctl to finish initialization
Jul 24 16:48:48.098833 sonic INFO hostcfgd: HostCfgd: signal 'SIGTERM' is caught and exiting...
Jul 24 16:48:48.186823 sonic NOTICE systemd[1]: hostcfgd.service: Main process exited, code=exited, status=143/n/a
Jul 24 16:48:48.186980 sonic WARNING systemd[1]: hostcfgd.service: Failed with result 'exit-code'.

Sometimes even core dumps are seen:

root@sonic:/home/admin# TERM=xterm-color cgdb python /var/core/hostcfgd.1721828951.87007.core
(gdb) set style enabled off
(gdb) bt
#0  0x00007f962abd5ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f962abbf537 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f9628f977ec in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f9628fa2966 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f9628fa29d1 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f9628fa23cc in __gxx_personality_v0 () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f9628ef38a4 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7  0x00007f9628ef3f4e in _Unwind_ForcedUnwind () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7  0x00007f9628ef3f4e in _Unwind_ForcedUnwind () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#8  0x00007f962af1fc30 in __pthread_unwind () from /lib/x86_64-linux-gnu/libpthread.so.0
#9  0x00007f962af1718c in pthread_exit () from /lib/x86_64-linux-gnu/libpthread.so.0
#10 0x0000000000645df5 in PyThread_exit_thread ()
#11 0x00000000004262ae in ?? ()
#12 0x00000000005327b2 in PyEval_RestoreThread ()
#13 0x00007f962983ede6 in ?? () from /usr/lib/python3/dist-packages/swsscommon/_swsscommon.so
#14 0x000000000053f350 in ?? ()
#15 0x000000000053c3c9 in PyObject_Call ()
#16 0x0000000000517b9b in _PyEval_EvalFrameDefault ()
#17 0x00000000005106ed in ?? ()
#18 0x0000000000528d21 in _PyFunction_Vectorcall ()
#19 0x0000000000537b80 in ?? ()
#20 0x000000000051d6f5 in _PyObject_MakeTpCall ()
#21 0x00000000005175ba in _PyEval_EvalFrameDefault ()
#22 0x0000000000528b63 in _PyFunction_Vectorcall ()
#23 0x000000000053bdb1 in ?? ()
#24 0x0000000000513e8b in _PyEval_EvalFrameDefault ()
#25 0x0000000000528b63 in _PyFunction_Vectorcall ()
#26 0x0000000000512192 in _PyEval_EvalFrameDefault ()
#27 0x0000000000528b63 in _PyFunction_Vectorcall ()
#28 0x0000000000512192 in _PyEval_EvalFrameDefault ()
#29 0x0000000000528b63 in _PyFunction_Vectorcall ()
#30 0x000000000053bdb1 in ?? ()
#31 0x00000000006430b6 in ?? ()
#32 0x000000000063b164 in ?? ()
#33 0x00007f962af15ea7 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#34 0x00007f962ac98a6f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Describe the results you expected:

Application suppose to be terminated few seconds after SIGTERM is received

Output of show version:

Output of show techsupport:

Additional information you deem important (e.g. issue happens only occasionally):

saiarcot895 commented 1 month ago

For my reference, a backtrace where hostcfgd hits SIGABRT:

(gdb) bt 25
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fc932e51537 in __GI_abort () at abort.c:79
#2  0x00007fc9311517ec in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007fc93115c966 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x00007fc93115c9d1 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:58
#5  0x00007fc93115c3cc in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=10, exception_class=0, ue_header=0x7fc930682d70, context=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_personality.cc:673
#6  0x00007fc9310ad8a4 in _Unwind_ForcedUnwind_Phase2 (exc=0x7fc930682d70, context=0x7fc9306811c0, frames_p=0x7fc9306810c8) at ../../../src/libgcc/unwind.inc:182
#7  0x00007fc9310adf4e in _Unwind_ForcedUnwind (exc=0x7fc930682d70, stop=stop@entry=0x7fc9331b1ab0 <unwind_stop>, stop_argument=0x7fc930681f10) at ../../../src/libgcc/unwind.inc:217
#8  0x00007fc9331b1c30 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:121
#9  0x00007fc9331a918c in __do_cancel () at pthreadP.h:310
#10 __pthread_exit (value=value@entry=0x0) at pthread_exit.c:28
#11 0x0000000000645df5 in PyThread_exit_thread () at ../Python/thread_pthread.h:373
#12 0x00000000004262ae in take_gil (tstate=0x2bd6370) at ../Python/ceval_gil.h:224
#13 0x00000000005327b2 in PyEval_RestoreThread (tstate=tstate@entry=0x2bd6370) at ../Python/ceval.c:467
#14 0x00007fc9319f48de in PyThreadStateGuard::~PyThreadStateGuard (this=<synthetic pointer>, __in_chrg=<optimized out>) at pyext/py3/swsscommon_wrap.cpp:32665
#15 _wrap_Select_select (args=<optimized out>, kwargs=<optimized out>) at pyext/py3/swsscommon_wrap.cpp:32667
#16 0x000000000053f350 in cfunction_call (func=<built-in method Select_select of module object at remote 0x7fc931b28540>, args=<optimized out>, kwargs=<optimized out>) at ../Objects/methodobject.c:539
#17 0x000000000051d89b in _PyObject_MakeTpCall (tstate=0x2bd6370, callable=<built-in method Select_select of module object at remote 0x7fc931b28540>, args=<optimized out>, nargs=<optimized out>, keywords=<optimized out>) at ../Objects/call.c:191
#18 0x00000000005175ba in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7fc9306c21d0, callable=<built-in method Select_select of module object at remote 0x7fc931b28540>, tstate=0x2bd6370) at ../Include/cpython/abstract.h:116
#19 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7fc9306c21d0, callable=<built-in method Select_select of module object at remote 0x7fc931b28540>, tstate=0x2bd6370) at ../Include/cpython/abstract.h:103
#20 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7fc9306c21d0, callable=<built-in method Select_select of module object at remote 0x7fc931b28540>) at ../Include/cpython/abstract.h:127
#21 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x2bd6370) at ../Python/ceval.c:5072
#22 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:3487
#23 0x00000000005106ed in _PyEval_EvalFrame (throwflag=0,
    f=Frame 0x7fc9306c2040, for file /usr/lib/python3/dist-packages/swsscommon/swsscommon.py, line 2112, in select (self=<Select(this=<SwigPyObject at remote 0x7fc9306c1330>) at remote 0x7fc9306c13a0>, timeout=180000, interrupt_on_signal=False), tstate=0x2bd6370)
    at ../Include/internal/pycore_ceval.h:40
#24 _PyEval_EvalCode (tstate=0x2bd6370, _co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=0x0, kwargs=0x7fc9309a2bb8, kwcount=<optimized out>, kwstep=1, defs=0x7fc931c09cd8, defcount=<optimized out>,
    kwdefs=0x0, closure=0x0, name='select', qualname='Select.select') at ../Python/ceval.c:4327
saiarcot895 commented 1 month ago

Just to check, can you verify that it is not seen in 202311?

vaibhavhd commented 1 month ago

@nazariig, can you please respond to @saiarcot895 's question and comment?