Open jiridanek opened 2 years ago
Issue happens even when running test single-threaded, https://github.com/jiridanek/skupper-router/runs/5428006739?check_suite_focus=true#step:27:5701
I've tried investigating the test. There seem to be multiple possible issues that can cause the test to fail, and getting over one exposes another.
The logs quoted in this message are from a PR with extra logging added. E.g. to log full stracktrace of the exception, and such.
https://github.com/jiridanek/skupper-router/runs/5428035645?check_suite_focus=true#step:9:7800
This gets raised from line 739 doing
I do not see how it is possible for SyncRequestResponse#call
to return None (it should always either return received Message, or raise a TimeoutException, as far as I can tell), but nevertheless, that is what seems to be happening
Getting a None might be possible only if the management client gets disconnected after it or the other peer (the router) closed connection, while management client waits for its reply, given this code in Proton
if self.disconnected and not self._is_closed():
raise ConnectionException(
"Connection %s disconnected: %s" % (self.url, self.disconnected))
That cannot be possible, however, because the previous test has failed (https://github.com/jiridanek/skupper-router/runs/5428035645?check_suite_focus=true#step:9:7749) less than a second prior; there was not enough time for a timeout to occur.
There is a timeout-expired message in INT.A.log, lines 8718-8724, that looks like idle timeout from management client! https://gist.github.com/jiridanek/109918a935e0325ec01bd193ed488984#file-int-a-log
2022-03-04 20:57:27.086512 +0000 ROUTER_CORE (trace) Core action 'link_flow' (../src/router_core/router_core_thread.c:235)
2022-03-04 20:57:27.186954 +0000 PROTOCOL (trace) [C6]:FRAME: 0 -> @close(24) [error=@error(29) [condition=:"amqp:resource-limit-exceeded", description="local-idle-timeout expired"]] (../src/server.c:110)
2022-03-04 20:57:27.187026 +0000 SERVER (info) [C6] Connection from 127.0.0.1:32860 (to 0.0.0.0:20514) failed: amqp:resource-limit-exceeded local-idle-timeout expired (../src/server.c:1046)
2022-03-04 20:57:27.187091 +0000 PROTOCOL (trace) [C6]:FRAME: -> EOS (../src/server.c:110)
2022-03-04 20:57:27.187128 +0000 CONTAINER (debug) Aborting link '76c34267-770e-4fe2-b922-01acbd802824-$management' due to parent connection end (../src/container.c:325)
2022-03-04 20:57:27.187189 +0000 CONTAINER (debug) Aborting link '76c34267-770e-4fe2-b922-01acbd802824-df0fbdae-648a-4b11-a357-f310eb8df2e2' due to parent connection end (../src/container.c:325)
2022-03-04 20:57:27.187290 +0000 POLICY (debug) [C6] Connection '127.0.0.1:32860' closed with resources n_sessions=1, n_senders=1, n_receivers=1, sessions_denied=0, senders_denied=0, receivers_denied=0, max_message_size_denied:0, nConnections= 3. (../src/policy.c:319)
https://github.com/jiridanek/skupper-router/runs/5428035645?check_suite_focus=true#step:9:5382
The test code assumes one reject, or up-to-one reject, depending on the specific test
I tried adjusting this to tolerate two rejects, because that is what I am often seeing when running tests with -j100. Then, I got
23: DEBUG root:system_test.py:1439 2022-03-05 11:46:29.951349 OversizeMulticastTransferTest - multicast/e80: FAIL: N closed events in log file did not increment by 1. oBefore: 1, oAfter: 1, iBefore:2, iAfter:4
(from https://github.com/jiridanek/skupper-router/runs/5432676676?check_suite_focus=true#step:9:4004)
[before:]
2022-03-05 11:46:14.293825 +0000 PROTOCOL (trace) [C1]:FRAME: 0 <- @close(24) [error=@error(29) [condition=:"amqp:connection:forced", description="Message size exceeded"]] (../src/server.c:110)
2022-03-05 11:46:18.765349 +0000 PROTOCOL (trace) [C6]:FRAME: 0 <- @close(24) [error=@error(29) [condition=:"amqp:connection:forced", description="Message size exceeded"]] (../src/server.c:110)
2022-03-05 11:46:22.783473 +0000 PROTOCOL (trace) [C9]:FRAME: 0 -> @close(24) [error=@error(29) [condition=:"amqp:connection:forced", description="Message size exceeded"]] (../src/server.c:110)
[after:]
2022-03-05 11:46:26.544117 +0000 PROTOCOL (trace) [C8]:FRAME: 0 <- @close(24) [error=@error(29) [condition=:"amqp:connection:forced", description="Message size exceeded"]] (../src/server.c:110)
2022-03-05 11:46:28.884387 +0000 PROTOCOL (trace) [C12]:FRAME: 0 <- @close(24) [error=@error(29) [condition=:"amqp:connection:forced", description="Message size exceeded"]] (../src/server.c:110)
The test expected only one such line in the "After:", that is, log lines added between start of test and the final evaluation.
https://github.com/jiridanek/skupper-router/runs/5433073293?check_suite_focus=true#step:9:7705
Sometimes, the messages just don't go through.
https://github.com/jiridanek/skupper-router/runs/5426622599?check_suite_focus=true#step:9:5364