skupperproject / skupper-router

An application-layer router for Skupper networks
https://skupper.io
Apache License 2.0
14 stars 18 forks source link

system_tests_two_routers.TwoRouterExtensionStateTest testMethod=test_03_multicast failing in CI #1573

Open ganeshmurthy opened 1 month ago

ganeshmurthy commented 1 month ago
29: self = <tests.system_tests_two_routers.TwoRouterExtensionStateTest testMethod=test_03_multicast>
29: 
29:     def test_03_multicast(self):
29:         """
29:         Verify that disposition state set by the publisher is available to all
29:         consumers
29:         """
29:         rxs = [MyExtendedReceiver(self.RouterA.addresses[0],
29:                                   "multicast/thingy")
29:                for x in range(3)]
29:         self.RouterA.wait_address("multicast/thingy", subscribers=3)
29:         sleep(0.5)  # let subscribers grant credit
29:         tx = MyExtendedSender(self.RouterB.addresses[0],
29:                               "multicast/thingy")
29:         tx.wait()
29:     
29:         # DISPATCH-1705: only one of the receivers gets the data, but all
29:         # should get the state
29:     
29:         ext_data = None
29:         for rx in rxs:
29:             rx.stop()
29:             try:
29:                 while True:
29:                     dispo = rx.remote_states.pop()
29: >                   self.assertEqual(999, dispo[0])
29: E                   AssertionError: 999 != 0
29: 
29: /home/runner/work/skupper-router/skupper-router/skupper-router/tests/system_tests_two_routers.py:1756: AssertionError

https://github.com/skupperproject/skupper-router/actions/runs/9977439739/job/27572204457#step:10:2608

kgiusti commented 1 month ago

We should verify with the Proton Illuminati that this test is not doing something unsupported by proton.

What the test is trying to verify is that when an AMQP sender sets the delivery's local Disposition state and associated data that state/data should be duplicated across the multicast receivers. The test routine test_03_multicast is here: https://github.com/skupperproject/skupper-router/blob/266080f2e25a879298d98d8b695bcd0ead685b52/tests/system_tests_two_routers.py#L1733

The test creates an AMQP sender derived indirectly from Proton's MessagingHandler class. The sender creates the outgoing delivery and sets the state here: https://github.com/skupperproject/skupper-router/blob/266080f2e25a879298d98d8b695bcd0ead685b52/tests/system_tests_two_routers.py#L1771

The multicast AMQP receivers - also indirectly derived from Proton's MessagingHandler class - check the received delivery for the expected data here: https://github.com/skupperproject/skupper-router/blob/266080f2e25a879298d98d8b695bcd0ead685b52/tests/system_tests_two_routers.py#L1790

When the failure occurs it appears that the first AMQP receiver does in fact see the correct dispo stuff, but the other receivers do not:

1: RECEIVED DISPO=999 DATA=[1, 2, 3]
2: RECEIVED DISPO=0 DATA=None  (FAILURE)

IIRC this test was added to verify a fix for a bug reported by someone using an AMQP broker who was trying to propagate dispo state.

kgiusti commented 1 month ago

@astitcher can you take a look at the above and let me know if this behavior is expected? The test started failing when this proton commit landed: https://github.com/apache/qpid-proton/commit/e4e62f49e05a9f26b5c86f1eb829f2180604bac1

astitcher commented 1 month ago

@kgiusti can you point me to the code in the router that duplicates messages across the multicast send path - My guess is the new code screws up this duplication if there is disposition data set.

kgiusti commented 1 month ago

@astitcher Sure but it's gonna take a bit to page that logic back into my 640K-should-be-enough-for-anybody grey matter.

FWIW (which is probably not much) here's the protocol logs that show one ingress message arriving with state "999[1,2,3]" then the 3 egress messages. Note: in the trace the 999 value is in hex 0x3e7

The interesting bit is that all three egress messages do contain some sort of state. The first has the expected state. What's interesting is that the last two also seem to contain at least the 999 dispo value. The test is failing on those last two because when it looks at the dispo state it is zero.

INGRESS
[C3]:FRAME: 5 <- @transfer(20) [handle=0x1, delivery-id=0x2, delivery-tag=b"\x10\x00\x00\x00\x00\x00\x00\x00", message-format=0x0, state=@0x3e7[1, 2, 3]]

MCAST OUT 1:
[C7]:FRAME: 0 -> @transfer(20) [handle=0x0, delivery-id=0x0, delivery-tag=b"\x18\x00\x00\x00\x00\x00\x00\x00", message-format=0x0, state=@0x3e7[1, 2, 3]]

MCAST OUT 2:
[C9]:FRAME: 0 -> @transfer(20) [handle=0x0, delivery-id=0x0, delivery-tag=b"\x1a\x00\x00\x00\x00\x00\x00\x00", message-format=0x0, state=@0x3e7!!]<8!=7>

MCAST OUT 3
[C8]:FRAME: 0 -> @transfer(20) [handle=0x0, delivery-id=0x0, delivery-tag=b"\x19\x00\x00\x00\x00\x00\x00\x00", message-format=0x0, state=@0x3e7!!]<8!=7>

Is there any significance to the:

<8!=7>

at the end of the transfer decode? Never seen that before....

astitcher commented 1 month ago

That 8!=7 indicates a protocol error! So that might be relevant!! It indicates that the protocol said there were 8 bytes in the value but the overall frame only had enough bytes for 7 bytes

astitcher commented 1 month ago

It might (hopefully) be a stupid off by one error in the delivery disposition data logic

kgiusti commented 1 month ago

Ok that's more promising than what I'm looking at. FWIW here is the code where the dispo (999) is moved to the outgoing delivery:

https://github.com/skupperproject/skupper-router/blob/266080f2e25a879298d98d8b695bcd0ead685b52/src/router_core/delivery.c#L1401

And here's where that dispo is set in the outgoing proton delivery: https://github.com/skupperproject/skupper-router/blob/266080f2e25a879298d98d8b695bcd0ead685b52/src/adaptors/amqp/amqp_adaptor.c#L2103

I've traced it via the debugger. Those codepaths are pretty generic. I doubt if the propagation of the dispo code is causing the issue.

Each of those three outgoing messages have their own dedicated proton delivery instance, it's funny how their both being corrupted in the same way.

astitcher commented 1 month ago

Ooh, what's the difference between qdr_delivery_t and pn_delivery_t? Is there duplicate processing there?

kgiusti commented 1 month ago

qdr_delivery_t is the router's abstract delivery state. Its data is read/written from its corresponding pn_delivery_t (one to one relationship). The qdr_delivery_t is visible to the core process and the pn_delivery_t is not (because we don't want the core touching the pn_delivery_t - that is only allowed on the proper proactor thread).

pn_delivery_t is the proton C representation of the AMQP delivery.

kgiusti commented 1 month ago

Possible clue: if I change the test code to use a proper delivery outcome (range 0x23-0x27) the the dispo propagates fine. If I use any other value (<0x23 or >0x27) then the router fails to propagate that value and we see the 8!=7 in the protocol log.

The router is using the pn_delivery_update() API to set the dispo value.

astitcher commented 1 month ago

Yeah I think gives me a great place to start digging in the new proton disposition code. Thx

astitcher commented 1 month ago

I wonder if there is any way I can get the raw protocol bytes to go along with the frame decodes - that might give me a big clue about what went wrong.

astitcher commented 1 month ago

@kgiusti @ganeshmurthy I've now merged a fix for this to qpid-proton main branch. Let me know if this issue is now resolved.

ganeshmurthy commented 1 month ago

@astitcher @kgiusti I raised a PR with the test uncommented - https://github.com/skupperproject/skupper-router/pull/1587/files

I saw some new failures on the same test, not sure if this is related to the proton commit. Here is the failure - https://github.com/skupperproject/skupper-router/actions/runs/10116788047/job/27980377966?pr=1587#step:34:6818

ganeshmurthy commented 1 month ago

@astitcher @kgiusti I see the same test failure even on the main branch where the multicast test is commented out. I see the test failure in <tests.system_tests_two_routers.TwoRouterExtensionStateTest testMethod=test_02_closest>

Here is the same failure on main - https://github.com/skupperproject/skupper-router/actions/runs/10149317776/job/28064340945#step:10:5786