Open nazariig opened 1 month ago
For my reference, a backtrace where hostcfgd hits SIGABRT:
(gdb) bt 25
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fc932e51537 in __GI_abort () at abort.c:79
#2 0x00007fc9311517ec in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95
#3 0x00007fc93115c966 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:48
#4 0x00007fc93115c9d1 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:58
#5 0x00007fc93115c3cc in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=10, exception_class=0, ue_header=0x7fc930682d70, context=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_personality.cc:673
#6 0x00007fc9310ad8a4 in _Unwind_ForcedUnwind_Phase2 (exc=0x7fc930682d70, context=0x7fc9306811c0, frames_p=0x7fc9306810c8) at ../../../src/libgcc/unwind.inc:182
#7 0x00007fc9310adf4e in _Unwind_ForcedUnwind (exc=0x7fc930682d70, stop=stop@entry=0x7fc9331b1ab0 <unwind_stop>, stop_argument=0x7fc930681f10) at ../../../src/libgcc/unwind.inc:217
#8 0x00007fc9331b1c30 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:121
#9 0x00007fc9331a918c in __do_cancel () at pthreadP.h:310
#10 __pthread_exit (value=value@entry=0x0) at pthread_exit.c:28
#11 0x0000000000645df5 in PyThread_exit_thread () at ../Python/thread_pthread.h:373
#12 0x00000000004262ae in take_gil (tstate=0x2bd6370) at ../Python/ceval_gil.h:224
#13 0x00000000005327b2 in PyEval_RestoreThread (tstate=tstate@entry=0x2bd6370) at ../Python/ceval.c:467
#14 0x00007fc9319f48de in PyThreadStateGuard::~PyThreadStateGuard (this=<synthetic pointer>, __in_chrg=<optimized out>) at pyext/py3/swsscommon_wrap.cpp:32665
#15 _wrap_Select_select (args=<optimized out>, kwargs=<optimized out>) at pyext/py3/swsscommon_wrap.cpp:32667
#16 0x000000000053f350 in cfunction_call (func=<built-in method Select_select of module object at remote 0x7fc931b28540>, args=<optimized out>, kwargs=<optimized out>) at ../Objects/methodobject.c:539
#17 0x000000000051d89b in _PyObject_MakeTpCall (tstate=0x2bd6370, callable=<built-in method Select_select of module object at remote 0x7fc931b28540>, args=<optimized out>, nargs=<optimized out>, keywords=<optimized out>) at ../Objects/call.c:191
#18 0x00000000005175ba in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7fc9306c21d0, callable=<built-in method Select_select of module object at remote 0x7fc931b28540>, tstate=0x2bd6370) at ../Include/cpython/abstract.h:116
#19 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7fc9306c21d0, callable=<built-in method Select_select of module object at remote 0x7fc931b28540>, tstate=0x2bd6370) at ../Include/cpython/abstract.h:103
#20 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7fc9306c21d0, callable=<built-in method Select_select of module object at remote 0x7fc931b28540>) at ../Include/cpython/abstract.h:127
#21 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x2bd6370) at ../Python/ceval.c:5072
#22 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:3487
#23 0x00000000005106ed in _PyEval_EvalFrame (throwflag=0,
f=Frame 0x7fc9306c2040, for file /usr/lib/python3/dist-packages/swsscommon/swsscommon.py, line 2112, in select (self=<Select(this=<SwigPyObject at remote 0x7fc9306c1330>) at remote 0x7fc9306c13a0>, timeout=180000, interrupt_on_signal=False), tstate=0x2bd6370)
at ../Include/internal/pycore_ceval.h:40
#24 _PyEval_EvalCode (tstate=0x2bd6370, _co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=0x0, kwargs=0x7fc9309a2bb8, kwcount=<optimized out>, kwstep=1, defs=0x7fc931c09cd8, defcount=<optimized out>,
kwdefs=0x0, closure=0x0, name='select', qualname='Select.select') at ../Python/ceval.c:4327
Just to check, can you verify that it is not seen in 202311?
@nazariig, can you please respond to @saiarcot895 's question and comment?
Signed-off-by: Nazarii Hnydyn nazariig@nvidia.com
Description
During
warm-reboot
shutdown phase,hostcfgd
sometimes doesn't stop withSIGTERM
signal which causes a significant delay (systemd
default timeout before sendingSIGKILL
is 90 sec) that leads to BGP Graceful-Restart timeout (default value is 240 sec).The issue is caused by missing application graceful shutdown: usage of
sys.exit
in signal handler leads to the situation when main process keeps running even when shutdown is requestedhttps://github.com/sonic-net/sonic-host-services/blob/202305/scripts/hostcfgd#L86
Note: the bug happens during
warm-upgrade
from202305
to202311
Steps to reproduce the issue:
Describe the results you received:
Sometimes even core dumps are seen:
Describe the results you expected:
Application suppose to be terminated few seconds after
SIGTERM
is receivedOutput of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):