intel-staging / libptpmgmt_iaclocklib

The Clock Manager is a library for monitoring network time synchronization on the local platform. Disclaimer: This project is under development. All source code and features on the main branch is for the purpose of testing or evaluation and not production ready. We will upstream the code and archive this GitHub repo thereafter.
Other
4 stars 5 forks source link

Library hangs in jcl_status_wait after proxy failure #91

Open rupran opened 2 weeks ago

rupran commented 2 weeks ago

System information:

Modifications to existing code

To test the operability of the library under higher load, I did the following changes to the source code:

Error

The client application hangs, sometimes after a few messages, sometimes after around 300-400 received notification messages. At the time of the hang, the proxy prints the following message:

*** Error: Failed to send buffer Resource temporarily unavailable at line 96 in msgq_tport.cpp:sendBuffer

From my understanding, this indicates that the message queue used to transport the notification from the proxy to the client has run full, which in turn leads to the client never receiving an answer to the liveness check. In any case, this should not lead to an indefinite hang in the client application/library, so maybe the check_proxy_liveness function might require a rework to accommodate this scenario.

Backtrace of the jcl_test application

(gdb) thread apply all bt

Thread 2 (Thread 0x7ffff7a366c0 (LWP 1546241) "jclk_test"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007ffff7b1c603 in sys_futex (val3=27617742, uaddr2=0x7ffff7fc2c00 <JClkLibClient::ClientConnectMessage::cv_mtx>, utime=0x0, val=1, op=<optimized out>, uaddr=0x7ffff7fc2c80 <JClkLibClient::ClientConnectMessage::cv+64>) at /home/andreas/librtpi/src/pi_futex.h:29
#2  futex_cmp_requeue_pi (mutex=0x7ffff7fc2c00 <JClkLibClient::ClientConnectMessage::cv_mtx>, nr_requeue=0, val=27617742, cond=0x7ffff7fc2c40 <JClkLibClient::ClientConnectMessage::cv>) at /home/andreas/librtpi/src/pi_futex.h:89
#3  pi_cond_signal (cond=0x7ffff7fc2c40 <JClkLibClient::ClientConnectMessage::cv>, mutex=0x7ffff7fc2c00 <JClkLibClient::ClientConnectMessage::cv_mtx>) at pi_cond.c:141
#4  0x00007ffff7f97770 in rtpi::condition_variable::notify_one (this=0x7ffff7fc2c40 <JClkLibClient::ClientConnectMessage::cv>, lock=...) at /usr/local/include/rtpi/condition_variable.hpp:56
#5  0x00007ffff7f97540 in JClkLibClient::ClientConnectMessage::processMessage (this=0x7ffe20eca940, LxContext=..., TxContext=@0x7ffff7a357e0: 0x11) at connect_msg.cpp:104
#6  0x00007ffff7f9033b in JClkLibClient::ClientMessageQueueListenerContext::processMessage (this=0x555555570610, bmsg=0x7ffe20ecab50, txcontext=@0x7ffff7a357e0: 0x11) at msgq_tport.cpp:48
#7  0x00007ffff7f9d9a4 in JClkLibCommon::Transport::processMessage (context=...) at transport.cpp:89
#8  0x00007ffff7fa5864 in JClkLibCommon::MessageQueue::MqListenerWork (mqListenerContext=0x555555570620) at msgq_tport.cpp:88
#9  0x00007ffff7f937a2 in std::__invoke_impl<bool, bool (*&)(JClkLibCommon::TransportContext*), JClkLibCommon::TransportListenerContext*> (__f=@0x7ffff7a35d70: 0x7ffff7fa5300 <JClkLibCommon::MessageQueue::MqListenerWork(JClkLibCommon::TransportContext*)>) at /usr/include/c++/12/bits/invoke.h:61
#10 0x00007ffff7f935bd in std::__invoke_r<bool, bool (*&)(JClkLibCommon::TransportContext*), JClkLibCommon::TransportListenerContext*> (__fn=@0x7ffff7a35d70: 0x7ffff7fa5300 <JClkLibCommon::MessageQueue::MqListenerWork(JClkLibCommon::TransportContext*)>) at /usr/include/c++/12/bits/invoke.h:114
#11 0x00007ffff7f9338c in std::_Function_handler<bool (JClkLibCommon::TransportListenerContext*), bool (*)(JClkLibCommon::TransportContext*)>::_M_invoke(std::_Any_data const&, JClkLibCommon::TransportListenerContext*&&) (__functor=..., __args#0=@0x7ffff7a35bb0: 0x555555570620) at /usr/include/c++/12/bits/std_function.h:290
#12 0x00007ffff7f9f5e5 in std::function<bool (JClkLibCommon::TransportListenerContext*)>::operator()(JClkLibCommon::TransportListenerContext*) const (this=0x7ffff7a35d70, __args#0=0x555555570620) at /usr/include/c++/12/bits/std_function.h:591
#13 0x00007ffff7f9cfde in JClkLibCommon::Transport::dispatchLoop(std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > >) (promise=..., exitVal=std::shared_ptr<std::atomic<bool>> (use count 2, weak count 0) = {...}, work={...}) at transport.cpp:39
#14 0x00007ffff7fa506d in std::__invoke_impl<void, void (*)(std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > >), std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > > >(std::__invoke_other, void (*&&)(std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > >), std::promise<bool>&&, std::shared_ptr<std::atomic<bool> >&&, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > >&&) (__f=@0x555555571708: 0x7ffff7f9ce86 <JClkLibCommon::Transport::dispatchLoop(std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > >)>) at /usr/include/c++/12/bits/invoke.h:61
#15 0x00007ffff7fa4f38 in std::__invoke<void (*)(std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > >), std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > > >(void (*&&)(std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > >), std::promise<bool>&&, std::shared_ptr<std::atomic<bool> >&&, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > >&&) (__fn=@0x555555571708: 0x7ffff7f9ce86 <JClkLibCommon::Transport::dispatchLoop(std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > >)>) at /usr/include/c++/12/bits/invoke.h:96
#16 0x00007ffff7fa4e2f in std::thread::_Invoker<std::tuple<void (*)(std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > >), std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > > > >::_M_invoke<0ul, 1ul, 2ul, 3ul>(std::_Index_tuple<0ul, 1ul, 2ul, 3ul>) (this=0x5555555716b8) at /usr/include/c++/12/bits/std_thread.h:252
#17 0x00007ffff7fa4d76 in std::thread::_Invoker<std::tuple<void (*)(std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > >), std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > > > >::operator()() (this=0x5555555716b8) at /usr/include/c++/12/bits/std_thread.h:259
#18 0x00007ffff7fa4c02 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > >), std::promise<bool>, std::shared_ptr<std::atomic<bool> >, std::pair<std::function<bool (JClkLibCommon::TransportListenerContext*)>, std::unique_ptr<JClkLibCommon::TransportListenerContext, std::default_delete<JClkLibCommon::TransportListenerContext> > > > > >::_M_run() (this=0x5555555716b0) at /usr/include/c++/12/bits/std_thread.h:210
#19 0x00007ffff7df54a3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#20 0x00007ffff7ba9134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#21 0x00007ffff7c297dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7ffff7a37740 (LWP 1546234) "jclk_test"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007ffff7b1c2c3 in sys_futex (val3=0, uaddr2=0x0, utime=0x0, val=0, op=<optimized out>, uaddr=0x7ffff7fc2c00 <JClkLibClient::ClientConnectMessage::cv_mtx>) at /home/andreas/librtpi/src/pi_futex.h:29
#2  futex_lock_pi (mutex=0x7ffff7fc2c00 <JClkLibClient::ClientConnectMessage::cv_mtx>) at /home/andreas/librtpi/src/pi_futex.h:38
#3  pi_mutex_lock (mutex=mutex@entry=0x7ffff7fc2c00 <JClkLibClient::ClientConnectMessage::cv_mtx>) at pi_mutex.c:63
#4  0x00007ffff7b1c52e in pi_cond_timedwait (cond=0x7ffff7fc2c40 <JClkLibClient::ClientConnectMessage::cv>, mutex=0x7ffff7fc2c00 <JClkLibClient::ClientConnectMessage::cv_mtx>, abstime=0x7fffffffdb00) at pi_cond.c:89
#5  0x00007ffff7f8f5f0 in rtpi::condition_variable::wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7ffff7fc2c40 <JClkLibClient::ClientConnectMessage::cv>, lock=..., timeout_time=...) at /usr/local/include/rtpi/condition_variable.hpp:222
#6  0x00007ffff7f8ed33 in rtpi::condition_variable::wait_until<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7ffff7fc2c40 <JClkLibClient::ClientConnectMessage::cv>, lock=..., timeout_time=...) at /usr/local/include/rtpi/condition_variable.hpp:165
#7  0x00007ffff7f8d439 in check_proxy_liveness (appClientState=...) at jclk_init.cpp:193
#8  0x00007ffff7f8d8cc in JClkLibClient::JClkLibClientApi::jcl_status_wait (this=0x55555556feb0, timeout=0, jcl_state_ref=..., eventCountRef=...) at jclk_init.cpp:231
#9  0x00005555555581b1 in main (argc=5, argv=0x7fffffffe4a8) at jclk_test.cpp:232
yoongsiang2 commented 2 weeks ago

@rupran thanks for the detail. We are trying to reproduce the issue. With timeout = 0s and without sleep, we do see the failure in cmAPI->jcl_status_wait() where "[jclklib][%.3f] Terminating: lost connection to jclklib Proxy" is printed out on client side. However, we only observe the client is terminated, not hang. Can you explain more on your "hang" situation, whether it means client termination or client is still running but not responding?

btw, we agree that current check_proxy_liveness mechanism is adding burden to message queue. We will come out with another better and simpler mechanism to check whether proxy is alive or not.

rupran commented 2 weeks ago

After some more testing, the observed hang on our system might be related to a tuning issue of the system itself and not a problem with the library. With fixed tuning settings, I can also only replicate the issue as you described, leading to an abort with the following message:

[jclklib][344108.056] Terminating: lost connection to jclklib Proxy
terminate called without an active exception
./run_jclk_test.sh: line 14: 55235 Aborted                 LD_LIBRARY_PATH=$TEST_PATH chrt -f 99 $SCRIPT_PATH/jclk_test "$@"

Simplifying the check sounds like a good plan, thanks!

yoongsiang2 commented 14 hours ago

@rupran fyi, the improved version of liveness check design is merged in latest main branch. We tried on our side and does not hit into termination for hours of run. Can you help to confirm whether your side got similar observation? thanks in advance.