Open rupran opened 2 weeks ago
@rupran thanks for the detail. We are trying to reproduce the issue. With timeout = 0s and without sleep, we do see the failure in cmAPI->jcl_status_wait() where "[jclklib][%.3f] Terminating: lost connection to jclklib Proxy" is printed out on client side. However, we only observe the client is terminated, not hang. Can you explain more on your "hang" situation, whether it means client termination or client is still running but not responding?
btw, we agree that current check_proxy_liveness mechanism is adding burden to message queue. We will come out with another better and simpler mechanism to check whether proxy is alive or not.
After some more testing, the observed hang on our system might be related to a tuning issue of the system itself and not a problem with the library. With fixed tuning settings, I can also only replicate the issue as you described, leading to an abort with the following message:
[jclklib][344108.056] Terminating: lost connection to jclklib Proxy
terminate called without an active exception
./run_jclk_test.sh: line 14: 55235 Aborted LD_LIBRARY_PATH=$TEST_PATH chrt -f 99 $SCRIPT_PATH/jclk_test "$@"
Simplifying the check sounds like a good plan, thanks!
@rupran fyi, the improved version of liveness check design is merged in latest main branch. We tried on our side and does not hit into termination for hours of run. Can you help to confirm whether your side got similar observation? thanks in advance.
System information:
jcl_test
binary started with-l -5 -u 5
limits to test gmOffset events at synchronization precision boundaryModifications to existing code
To test the operability of the library under higher load, I did the following changes to the source code:
sleep(idle_time)
calls in the main loop injclklib/sample/jclk_test.cpp
cmAPI->jcl_status_wait()
fromtimeout
to0
(to do only one check and return immediately) in the main loopError
The client application hangs, sometimes after a few messages, sometimes after around 300-400 received notification messages. At the time of the hang, the proxy prints the following message:
From my understanding, this indicates that the message queue used to transport the notification from the proxy to the client has run full, which in turn leads to the client never receiving an answer to the liveness check. In any case, this should not lead to an indefinite hang in the client application/library, so maybe the
check_proxy_liveness
function might require a rework to accommodate this scenario.Backtrace of the
jcl_test
application