Open ganeshmurthy opened 1 month ago
We should verify with the Proton Illuminati that this test is not doing something unsupported by proton.
What the test is trying to verify is that when an AMQP sender sets the delivery's local Disposition state and associated data that state/data should be duplicated across the multicast receivers. The test routine test_03_multicast is here: https://github.com/skupperproject/skupper-router/blob/266080f2e25a879298d98d8b695bcd0ead685b52/tests/system_tests_two_routers.py#L1733
The test creates an AMQP sender derived indirectly from Proton's MessagingHandler class. The sender creates the outgoing delivery and sets the state here: https://github.com/skupperproject/skupper-router/blob/266080f2e25a879298d98d8b695bcd0ead685b52/tests/system_tests_two_routers.py#L1771
The multicast AMQP receivers - also indirectly derived from Proton's MessagingHandler class - check the received delivery for the expected data here: https://github.com/skupperproject/skupper-router/blob/266080f2e25a879298d98d8b695bcd0ead685b52/tests/system_tests_two_routers.py#L1790
When the failure occurs it appears that the first AMQP receiver does in fact see the correct dispo stuff, but the other receivers do not:
1: RECEIVED DISPO=999 DATA=[1, 2, 3]
2: RECEIVED DISPO=0 DATA=None (FAILURE)
IIRC this test was added to verify a fix for a bug reported by someone using an AMQP broker who was trying to propagate dispo state.
@astitcher can you take a look at the above and let me know if this behavior is expected? The test started failing when this proton commit landed: https://github.com/apache/qpid-proton/commit/e4e62f49e05a9f26b5c86f1eb829f2180604bac1
@kgiusti can you point me to the code in the router that duplicates messages across the multicast send path - My guess is the new code screws up this duplication if there is disposition data set.
@astitcher Sure but it's gonna take a bit to page that logic back into my 640K-should-be-enough-for-anybody grey matter.
FWIW (which is probably not much) here's the protocol logs that show one ingress message arriving with state "999[1,2,3]" then the 3 egress messages. Note: in the trace the 999 value is in hex 0x3e7
The interesting bit is that all three egress messages do contain some sort of state. The first has the expected state. What's interesting is that the last two also seem to contain at least the 999 dispo value. The test is failing on those last two because when it looks at the dispo state it is zero.
INGRESS
[C3]:FRAME: 5 <- @transfer(20) [handle=0x1, delivery-id=0x2, delivery-tag=b"\x10\x00\x00\x00\x00\x00\x00\x00", message-format=0x0, state=@0x3e7[1, 2, 3]]
MCAST OUT 1:
[C7]:FRAME: 0 -> @transfer(20) [handle=0x0, delivery-id=0x0, delivery-tag=b"\x18\x00\x00\x00\x00\x00\x00\x00", message-format=0x0, state=@0x3e7[1, 2, 3]]
MCAST OUT 2:
[C9]:FRAME: 0 -> @transfer(20) [handle=0x0, delivery-id=0x0, delivery-tag=b"\x1a\x00\x00\x00\x00\x00\x00\x00", message-format=0x0, state=@0x3e7!!]<8!=7>
MCAST OUT 3
[C8]:FRAME: 0 -> @transfer(20) [handle=0x0, delivery-id=0x0, delivery-tag=b"\x19\x00\x00\x00\x00\x00\x00\x00", message-format=0x0, state=@0x3e7!!]<8!=7>
Is there any significance to the:
<8!=7>
at the end of the transfer decode? Never seen that before....
That 8!=7 indicates a protocol error! So that might be relevant!! It indicates that the protocol said there were 8 bytes in the value but the overall frame only had enough bytes for 7 bytes
It might (hopefully) be a stupid off by one error in the delivery disposition data logic
Ok that's more promising than what I'm looking at. FWIW here is the code where the dispo (999) is moved to the outgoing delivery:
And here's where that dispo is set in the outgoing proton delivery: https://github.com/skupperproject/skupper-router/blob/266080f2e25a879298d98d8b695bcd0ead685b52/src/adaptors/amqp/amqp_adaptor.c#L2103
I've traced it via the debugger. Those codepaths are pretty generic. I doubt if the propagation of the dispo code is causing the issue.
Each of those three outgoing messages have their own dedicated proton delivery instance, it's funny how their both being corrupted in the same way.
Ooh, what's the difference between qdr_delivery_t and pn_delivery_t? Is there duplicate processing there?
qdr_delivery_t is the router's abstract delivery state. Its data is read/written from its corresponding pn_delivery_t (one to one relationship). The qdr_delivery_t is visible to the core process and the pn_delivery_t is not (because we don't want the core touching the pn_delivery_t - that is only allowed on the proper proactor thread).
pn_delivery_t is the proton C representation of the AMQP delivery.
Possible clue: if I change the test code to use a proper delivery outcome (range 0x23-0x27) the the dispo propagates fine. If I use any other value (<0x23 or >0x27) then the router fails to propagate that value and we see the 8!=7 in the protocol log.
The router is using the pn_delivery_update() API to set the dispo value.
Yeah I think gives me a great place to start digging in the new proton disposition code. Thx
I wonder if there is any way I can get the raw protocol bytes to go along with the frame decodes - that might give me a big clue about what went wrong.
@kgiusti @ganeshmurthy I've now merged a fix for this to qpid-proton main branch. Let me know if this issue is now resolved.
@astitcher @kgiusti I raised a PR with the test uncommented - https://github.com/skupperproject/skupper-router/pull/1587/files
I saw some new failures on the same test, not sure if this is related to the proton commit. Here is the failure - https://github.com/skupperproject/skupper-router/actions/runs/10116788047/job/27980377966?pr=1587#step:34:6818
@astitcher @kgiusti I see the same test failure even on the main branch where the multicast test is commented out. I see the test failure in <tests.system_tests_two_routers.TwoRouterExtensionStateTest testMethod=test_02_closest>
Here is the same failure on main - https://github.com/skupperproject/skupper-router/actions/runs/10149317776/job/28064340945#step:10:5786
https://github.com/skupperproject/skupper-router/actions/runs/9977439739/job/27572204457#step:10:2608