skupperproject / skupper-router

An application-layer router for Skupper networks
https://skupper.io
Apache License 2.0
14 stars 18 forks source link

DNS name resolution lookup misses can block the router timer by several seconds. #1451

Open kgiusti opened 5 months ago

kgiusti commented 5 months ago

Attempting to establish a TCP connection can hang the router timer for several seconds if the DNS resolution fails.

Reproducer: run the system_tests_handle_failover test, then "grep" router 'A.log' for the "process_tick" log message. These should occur once a second. Check the timestamp associated with the log message. You'll notice that the log events do not occur on one-second intervals as expected.

Example:

$ ctest -V -R system_tests_handle_failover
...
$ grep "process_tick"  ./tests/system_test.dir/system_tests_handle_failover/FailoverTest/setUpClass/A.log
...
2024-03-22 10:39:50.171629 -0400 ROUTER_CORE (debug) Core action 'process_tick' (/home/kgiusti/work/skupper/skupper-router/src/router_core/router_core_thread.c:253)
2024-03-22 10:39:51.172545 -0400 ROUTER_CORE (debug) Core action 'process_tick' (/home/kgiusti/work/skupper/skupper-router/src/router_core/router_core_thread.c:253)
2024-03-22 10:39:52.172506 -0400 ROUTER_CORE (debug) Core action 'process_tick' (/home/kgiusti/work/skupper/skupper-router/src/router_core/router_core_thread.c:253)
2024-03-22 **10:40:04.417526** -0400 ROUTER_CORE (debug) Core action 'process_tick' (/home/kgiusti/work/skupper/skupper-router/src/router_core/router_core_thread.c:253)
2024-03-22 **10:40:12.169492** -0400 ROUTER_CORE (debug) Core action 'process_tick' (/home/kgiusti/work/skupper/skupper-router/src/router_core/router_core_thread.c:253)
2024-03-22 **10:40:16.169809** -0400 ROUTER_CORE (debug) Core action 'process_tick' (/home/kgiusti/work/skupper/skupper-router/src/router_core/router_core_thread.c:253)
2024-03-22 10:40:17.170732 -0400 ROUTER_CORE (debug) Core action 'process_tick' (/home/kgiusti/work/skupper/skupper-router/src/router_core/router_core_thread.c:253)

This could delay other timer events to the point where instability of the router can occur.

kgiusti commented 4 months ago

The following proton patch appears to fix this issue: https://issues.apache.org/jira/browse/PROTON-2812

I'm going to mark this as blocked on that above jira.