skupperproject / skupper-router

An application-layer router for Skupper networks
https://skupper.io
Apache License 2.0
14 stars 18 forks source link

Test system_tests_distribution does not run under rr #1042

Open jiridanek opened 1 year ago

jiridanek commented 1 year ago
  1. Use current git tip of skupper-router and qpid-proton.

  2. Configure skrouterd with this option (adjust the path for your checkout, maybe consider adding --chaos to the rr record command line)

    "-DQDROUTERD_RUNNER=/home/jdanek/repos/skupper-router/scripts/sigforwarder.py /usr/bin/rr record --print-trace-dir=1 --continue-through-signal=15"

as described in https://github.com/skupperproject/skupper-router/blob/main/docs/notes/debugging.adoc#rr-workflow

  1. Run the system_tests_distribution test and see it fail on router C startup.
cmake \
-G Ninja -S /home/jdanek/repos/skupper-router -B /home/jdanek/repos/skupper-router/cmake-build-relwithdebinfo-rr \
"-DQDROUTERD_RUNNER=/home/jdanek/repos/skupper-router/scripts/sigforwarder.py /usr/bin/rr record --print-trace-dir=1 --continue-through-signal=15" \
-DCMAKE_INSTALL_PREFIX=install \
-DProton_DIR=/home/jdanek/repos/qpid/qpid-proton/build/install/lib64/cmake/Proton \
-DPython_EXECUTABLE=/home/jdanek/repos/skupper-router/.venv/bin/python

cd cmake-build-relwithdebinfo-rr
ninja
ctest -VV -R system_tests_distribution
34: ======================================================================
34: ERROR: setUpClass (system_tests_distribution.DistributionTests)
34: ----------------------------------------------------------------------
34: Traceback (most recent call last):
34:   File "/home/jdanek/repos/skupper-router/tests/system_tests_distribution.py", line 266, in setUpClass
34:     router('C',
34:   File "/home/jdanek/repos/skupper-router/tests/system_tests_distribution.py", line 150, in router
34:     cls.routers.append(cls.tester.qdrouterd(name, config, wait=True))
34:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
34:   File "/home/jdanek/repos/skupper-router/tests/system_test.py", line 1213, in qdrouterd
34:     return self.cleanup(Qdrouterd(*args, **kwargs))
34:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
34:   File "/home/jdanek/repos/skupper-router/tests/system_test.py", line 690, in __init__
34:     self.wait_ready()
34:   File "/home/jdanek/repos/skupper-router/tests/system_test.py", line 998, in wait_ready
34:     self.wait_connectors(**retry_kwargs)
34:   File "/home/jdanek/repos/skupper-router/tests/system_test.py", line 962, in wait_connectors
34:     assert retry(lambda c=c: self.is_connected(port=c['port'], host=c.get('host') if c.get('host') else self.get_host(c.get('socketAddressFamily'))),
34: AssertionError: Port not connected 28520

When looking into the logs, I see that router C attempted to create the connections to router B, but router B did not answer

% grep '\[C[123]\]' < /home/jdanek/repos/skupper-router/cmake-build-relwithdebinfo-rrasan/tests/system_test.dir/system_tests_distribution/DistributionTests/setUpClass/C.log
2023-04-02 19:43:20.830464 +0200 SERVER (trace) [C1] Connecting to 127.0.0.1:28520 (/home/jdanek/repos/skupper-router/src/server.c:1195)
2023-04-02 19:43:20.831353 +0200 SERVER (trace) [C2] Connecting to 127.0.0.1:28520 (/home/jdanek/repos/skupper-router/src/server.c:1195)
2023-04-02 19:43:20.831802 +0200 SERVER (trace) [C3] Connecting to 127.0.0.1:28520 (/home/jdanek/repos/skupper-router/src/server.c:1195)
2023-04-02 19:43:20.832692 +0200 PROTOCOL (trace) [C1]:FRAME:   -> SASL (/home/jdanek/repos/skupper-router/src/server.c:111)
2023-04-02 19:43:20.832971 +0200 PROTOCOL (trace) [C2]:FRAME:   -> SASL (/home/jdanek/repos/skupper-router/src/server.c:111)
2023-04-02 19:43:20.833175 +0200 PROTOCOL (trace) [C3]:FRAME:   -> SASL (/home/jdanek/repos/skupper-router/src/server.c:111)

I tried asking in rr issuetracker if they expect any such problems with epoll-based programs and I was told that they are not aware of any, but that the program needs to be able to handle EINTR correctly

Back when I filled that rr issue, I also tried with libuv proactor in Proton, and then I did not have this problem.

@ganeshmurthy, @kgiusti, @astitcher, @cliffjansen do you think this might be worth trying to reproduce, if you see the same issue I do? On my laptop, the system_tests_distribution test gets stuck like this pretty much every time. I tried running the whole testsuite under rr, but (some, about 8 of them) the other tests were failing only intermittently.

jiridanek commented 1 year ago

I tried with the libuv proactor again, with -DPROACTOR=libuv in Proton. Currently it requires removing the check in Proton CMakeLists which does not allow building the tls lib with anything than epoll (Cannot currently build the raw connection TLS library without the epoll proactor and OpenSSL). Commenting this out is ok, if the goal is to pass the system_tests_distribution setUp.

This still seems to fix the router network startup, same as I reported in the rr issue:

34: Test command: /home/jdanek/repos/skupper-router/.venv/bin/python "/home/jdanek/repos/skupper-router/cmake-build-relwithdebinfo-rrasan/tests/run.py" "-m" "unittest" "-v" "system_tests_distribution"
34: Working Directory: /home/jdanek/repos/skupper-router/cmake-build-relwithdebinfo-rrasan/tests
34: Test timeout computed to be: 600
34: test_01_targeted_sender_AC (system_tests_distribution.DistributionTests.test_01_targeted_sender_AC) ... ok
34: test_02_targeted_sender_DC (system_tests_distribution.DistributionTests.test_02_targeted_sender_DC) ... ok
34: test_03_anonymous_sender_AC (system_tests_distribution.DistributionTests.test_03_anonymous_sender_AC) ...
[...]
jiridanek commented 1 year ago

You may notice these errors in the router logs. They are (AFAIK) simply caused by the system_test checking if port is open, and are harmless

2023-04-02 20:32:24.570475 +0200 SERVER (error) [C4] Connection from 127.0.0.1:52974 (to 0.0.0.0:28540) failed: amqp:connection:framing-error No protocol header found (connection aborted) (/home/jdanek/repos/skupper-router/src/server.c:1068)