[TCK Challenge] Order testing of JAXRSClientIT.sseBroadcastTest results

jamezp commented 1 year ago

The ee.jakarta.tck.ws.rs.jaxrs21.ee.sse.ssebroadcaster.JAXRSClientIT.sseBroadcastTest expects the results of a client to be in a specific order. Unless I'm missing something, I don't see where there is a requirement that events be received in a specific order. I'm not even sure how the receive order could be guaranteed.

In section 9.1 it states:

SSE is a messaging protocol where the event field corresponds to a topic, and where the id field can be used to validate event order and guarantee continuity.

I've been seeing this fail more often on our CI. Though we're still waiting for #1146 to be merged so we can see the messages, when I see it fail with a SNAPSHOT TCK, the contents are all there just in a different order. My assumption is the client receives the responses in a different order due to network activity.

This test has periodically failed for RESTEasy, but seems to be getting worse on CI. I've never been able to reproduce it locally.

Here are some logs from a failing test:

07-18-2023 12:07:00:  TRACE: [WIRE] - << Connection: keep-alive
07-18-2023 12:07:00:  TRACE: [WIRE] - << Content-Length: 5
07-18-2023 12:07:00:  TRACE: [WIRE] - << Content-Type: application/octet-stream
07-18-2023 12:07:00:  TRACE: [WIRE] - << Date: Tue, 18 Jul 2023 12:07:00 GMT
07-18-2023 12:07:00:  TRACE: [WIRE] - << CLOSE
07-18-2023 12:07:00:Client 0 Received message WELCOME3
07-18-2023 12:07:00:Client 0 Received message some_ServiceUnavailableEndpoint_message0
07-18-2023 12:07:00:Client 0 Received message some_ServiceUnavailableEndpoint_message1
07-18-2023 12:07:00:Client 0 Received message some_ServiceUnavailableEndpoint_message2
07-18-2023 12:07:00:Client 0 Received message some_ServiceUnavailableEndpoint_message3
07-18-2023 12:07:00:Client 0 Received message some_ServiceUnavailableEndpoint_message4
07-18-2023 12:07:00:Client 0 Received message some_ServiceUnavailableEndpoint_message5
07-18-2023 12:07:00:Client 0 Received message some_ServiceUnavailableEndpoint_message6
07-18-2023 12:07:00:Client 1 Received message WELCOME1
07-18-2023 12:07:00:Client 1 Received message some_ServiceUnavailableEndpoint_message0
07-18-2023 12:07:00:Client 1 Received message some_ServiceUnavailableEndpoint_message1
07-18-2023 12:07:00:Client 1 Received message some_ServiceUnavailableEndpoint_message2
07-18-2023 12:07:00:Client 1 Received message some_ServiceUnavailableEndpoint_message3
07-18-2023 12:07:00:Client 1 Received message some_ServiceUnavailableEndpoint_message4
07-18-2023 12:07:00:Client 1 Received message some_ServiceUnavailableEndpoint_message5
07-18-2023 12:07:00:Client 1 Received message some_ServiceUnavailableEndpoint_message6
07-18-2023 12:07:00:Client 2 Received message WELCOME2
07-18-2023 12:07:00:Client 2 Received message some_ServiceUnavailableEndpoint_message1
07-18-2023 12:07:00:Client 2 Received message some_ServiceUnavailableEndpoint_message0
07-18-2023 12:07:00:Client 2 Received message some_ServiceUnavailableEndpoint_message2
07-18-2023 12:07:00:Client 2 Received message some_ServiceUnavailableEndpoint_message3
07-18-2023 12:07:00:Client 2 Received message some_ServiceUnavailableEndpoint_message4
07-18-2023 12:07:00:Client 2 Received message some_ServiceUnavailableEndpoint_message5
07-18-2023 12:07:00:Client 2 Received message some_ServiceUnavailableEndpoint_message6
07-18-2023 12:07:00:Client 3 Received message WELCOME4
07-18-2023 12:07:00:Client 3 Received message some_ServiceUnavailableEndpoint_message1
07-18-2023 12:07:00:Client 3 Received message some_ServiceUnavailableEndpoint_message0
07-18-2023 12:07:00:Client 3 Received message some_ServiceUnavailableEndpoint_message2
07-18-2023 12:07:00:Client 3 Received message some_ServiceUnavailableEndpoint_message3
07-18-2023 12:07:00:Client 3 Received message some_ServiceUnavailableEndpoint_message4
07-18-2023 12:07:00:Client 3 Received message some_ServiceUnavailableEndpoint_message5
07-18-2023 12:07:00:Client 3 Received message some_ServiceUnavailableEndpoint_message6
07-18-2023 12:07:00:Client 4 Received message WELCOME0
07-18-2023 12:07:00:Client 4 Received message some_ServiceUnavailableEndpoint_message0
07-18-2023 12:07:00:Client 4 Received message some_ServiceUnavailableEndpoint_message1
07-18-2023 12:07:00:Client 4 Received message some_ServiceUnavailableEndpoint_message2
07-18-2023 12:07:00:Client 4 Received message some_ServiceUnavailableEndpoint_message3
07-18-2023 12:07:00:Client 4 Received message some_ServiceUnavailableEndpoint_message4
07-18-2023 12:07:00:Client 4 Received message some_ServiceUnavailableEndpoint_message5
07-18-2023 12:07:00:Client 4 Received message some_ServiceUnavailableEndpoint_message6

As you can see the data is all there, just in a different order than the test expects.

spericas commented 1 year ago

@alwin-joseph Where is this test located? @jamezp May help to describe a bit more what the test is doing. Can't tell from the description if this is a bug or not.

spericas commented 1 year ago

@jansupol Any thoughts about this?

jamezp commented 1 year ago

@spericas From what I can tell it just sends some events via an SSE broadcast. It then waits until the first BroadcastClient, effectively just a holder for the events, gets all the requests or 5 seconds has passed. It then processes the results of the client checking that each BroadcastClient, there should be 5, contains all the event messages sent. However, it expects those messages are added in the queue in a specific order.

jansupol commented 1 year ago

Yes, the test sends 7 messages and checks the responses are received, but the order of responses is not guaranteed if sent over the network. I am not sure how this can be reordered on a single machine. But for instance, if the broadcaster would use multiple threads for sending those 7 messages, then yes, the order can be different than the test expects.

The test can be easily fixed, I believe, either by sending those 7 broadcasts only after each response is received or by sorting the responses based on the message id.

The order is expected so that it is easily verified, not because the order is tested; the test has a goal to test broadcaster functionality.

jamezp commented 1 year ago

Yes, I can send a PR up to fix the test. I'm not sure why I didn't just do that. I guess I was ready to leave for my PTO and didn't think about it :)

I think the simplest solution would simply be to check the size, then use the Collection.containsAll() rather than iterate through each one. Or we could just use a working collection and remove the entries from there so we know the diff of what's expected vs what was received.

Like I said too, it only happens on CI for RESTEasy. I've never been able to reproduce this locally. I ran it 100 time the other day and it passed every times.

ivassile commented 1 year ago

@jamezp This issue is currently causing our EAP 8 pipeline not to complete. What's the status of the fix? Thanks!

jamezp commented 1 year ago

@jamezp This issue is currently causing our EAP 8 pipeline not to complete. What's the status of the fix? Thanks!

I've not had a chance to work on a PR. Even if we did, it would require a new TCK release which I'm not sure if there is a plan for or not.

scottmarlow commented 1 year ago

From the comments so far, I think this challenge should be accepted.

As per https://jakarta.ee/committees/specification/tckprocess, if this challenge is accepted, the accepted label should be added to this issue and the issue closed.

If this challenge is accepted, I expect that implementations could exclude the challenged test from the test run or include the failure in the TCK results. Either way, they should link to this accepted TCK challenge if they hit this ordering issue.

A possible solution would be to make the test ordering change and back port the fix to the https://github.com/jakartaee/rest/tree/release-4.0 branch and release https://www.eclipse.org/downloads/download.php?file=/jakartaee/restful-ws/3.1/jakarta-restful-ws-tck-3.1.4.zip with the fix. This should only be done if the REST Specification team feels that the test change is safe in that other implementations will not be impacted by the change.

brideck commented 1 year ago

From the comments so far, I think this challenge should be accepted.

Agreed. This is an intermittent failure we see in testing with Open Liberty as well. I don't think we would be opposed to seeing a fix for this backported to 3.1.x.

jim-krueger commented 1 year ago

I can add this change as a PR against the release-4.0 branch. Is it important to create a 3.1.4 for this as well or can the challenge just be accepted and the test excluded if needed?

jamezp commented 1 year ago

IMO it's only important if we do a new 3.1 release of the TCK. It would be nice to not have the intermittent failures, but not critical IMO. I think if accepted we can ignore the test via the maven-failsafe-plugin, correct?

jim-krueger commented 1 year ago

@jamezp IMO it's only important if we do a new 3.1 release of the TCK

Not sure why this is only important for a new 3.1 TCK release. Certainly the same intermittent failure would occur in 4.0 if not addressed. But I agree that it may not be important enough to warrant an additional 3.1 release of the TCK.

jamezp commented 1 year ago

@jamezp IMO it's only important if we do a new 3.1 release of the TCK

Not sure why this is only important for a new 3.1 TCK release. Certainly the same intermittent failure would occur in 4.0

Sorry yes, that is what I meant. We definitely need it for 4.0, it's more of a nice-to-have if we're doing another 3.1 release. That's just my take on it though :)

jim-krueger commented 1 year ago

I see that PR 1164 is already anticipating the creation of a version 3.1.4 of the TCK so this change might as well go along with that.

Any objections?

alwin-joseph commented 1 year ago

I see that PR 1164 is already anticipating the creation of a version 3.1.4 of the TCK so this change might as well go along with that.

Any objections?

In https://github.com/jakartaee/rest/pull/1164#issuecomment-1655627770 we were discussing to exclude the challenged test in 3.1.4 and fix in master/4.0.

If you agree the fix PR for this challenge can also be directed to master or 4.0 and we can exclude this test in 3.1.4.

jim-krueger commented 1 year ago

This would be the same for my PR (https://github.com/jakartaee/rest/pull/1173). I created this PR against release 3.1.x to match PR 1164 https://github.com/jakartaee/rest/pull/1164 but Santiago questioned this. Should both of these PRs be moved to master instead?

Thanks

On Aug 24, 2023, at 3:23 PM, Alwin Joseph @.***> wrote:

I see that PR 1164 https://github.com/jakartaee/rest/pull/1164 is already anticipating the creation of a version 3.1.4 of the TCK so this change might as well go along with that.

Any objections?

In #1164 (comment) https://github.com/jakartaee/rest/pull/1164#issuecomment-1655627770 we were discussing to exclude the challenged test in 3.1.4 and fix in master/4.0.

If you agree the fix PR for this challenge can also be directed to master or 4.0 and we can exclude this test in 3.1.4.

— Reply to this email directly, view it on GitHub https://github.com/jakartaee/rest/issues/1162#issuecomment-1692356908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHSLZQF4OKJ3KMIP4VPOMR3XW6Z47ANCNFSM6AAAAAA2OTYEUI. You are receiving this because you commented.

alwin-joseph commented 1 year ago

@spericas Can you please mark this as accepted (As per https://jakarta.ee/committees/specification/tckprocess, if this challenge is accepted, the accepted label should be added to this issue and the issue closed, ref https://github.com/jakartaee/rest/issues/1162#issuecomment-1677759200 ).

PR to exclude this test for 3.1.4 TCK is available at https://github.com/jakartaee/rest/pull/1173

jim-krueger commented 1 year ago

@alwin-joseph labeled and closed.

jakartaee / rest

[TCK Challenge] Order testing of JAXRSClientIT.sseBroadcastTest results #1162