jakartaee / rest

Jakarta RESTful Web Services
Other
353 stars 114 forks source link

TCK Challenge: `!Future.isDone` Issues with `ee.jakarta.tck.ws.rs.jaxrs21.ee.client.executor.rx.JAXRSClientIT` & `ee.jakarta.tck.ws.rs.jaxrs21.ee.client.rxinvoker.JAXRSClientIT` #1123

Closed Pandrex247 closed 1 year ago

Pandrex247 commented 1 year ago

Specification:

Jakarta RESTful Web Services

Challenged Test(s):

TCK Version:

3.1.1

Tested implementation:

Payara 6.2022.1.Alpha3 (RC2 tag)

Description:

When run on certain environments (e.g. an AWS instance with 2 vCPUs), a number of these tests fail sporadically due to "Future cannot be done, yet!". The failures are not consistent, implying some sort of timing issue - they happen frequently enough though that we haven't managed to get a clean run of all the tests in this class.

Looking at the way these tests work, they appear to be a bit optimistic with their timings. There's no coded delay, sleep, wait or equivalent mechanism between submitting an asynchronous task and checking if it hasn't completed yet.

Most of the tests run in a similar way, so I'll take getWithGenericResponseTest as an example. Line 211. This just calls a method with the same name in the parent class here: JAXRSClientIT.java (different package) Line 446

This steps through like so:

With that established, what I see on the AWS instance is that the asynchronous get request sometimes completes and returns successfully before the call to !future.isDone(), meaning the test fails.

  1. The client submits an asynchronous get request to the JAX-RS endpoint
  2. The JAX-RS endpoint is called asynchronously and returns
  3. The client checks whether the response has completed, sees that is has, and determines the test has failed (even though the call did happen asynchronously and did complete successfully)

Putting some log statements to print out the System.nanoTime() in, I get the following results locally (where the test always passes):

So there's roughly 11-12 milliseconds between the task getting submitted and checking if it hasn't completed yet, with the actual asynchronous task completing within around 33-34ms - not much margin for error.

Running the same scenario against our AWS instance with only 2 vCPUs, these are the timings from one of the runs where it fails:

As you can see (if you squint closely), the asynchronous get returned within roughly 6 milliseconds before the !future.isDone() check happens roughly 18 milliseconds later (so 12 milliseconds after the async call has returned), meaning the test "failed" even though functionally it worked. We can even see this if we look at the logs from the test itself:

Before Line 449: 2132680615242
07-15-2022 11:29:13:[Client EXECUTOR SERVICE check]: running from thread JAXRS_TCK_THREAD3
07-15-2022 11:29:13:  TRACE: [WIRE] - >> GET http://localhost:8080/jaxrs_jaxrs21_ee_client_executor_rx_web/resource/get
07-15-2022 11:29:13:  TRACE: [WIRE] - >> 
07-15-2022 11:29:13:  TRACE: [WIRE] - << 200 OK
07-15-2022 11:29:13:  TRACE: [WIRE] - << X-Frame-Options: SAMEORIGIN
07-15-2022 11:29:13:  TRACE: [WIRE] - << Server: Payara Server  6 #badassfish
07-15-2022 11:29:13:  TRACE: [WIRE] - << Content-Length: 3
07-15-2022 11:29:13:  TRACE: [WIRE] - << Content-Type: text/html
07-15-2022 11:29:13:  TRACE: [WIRE] - << X-Powered-By: Servlet/6.0 JSP/3.1 (Payara Server  6 #badassfish Java/Azul Systems, Inc./11)
After Line 449: 2132698293606
Before Line 1909: 2132698580708

Conclusion/ TLDR: the ee.jakarta.tck.ws.rs.jaxrs21.ee.client.rxinvoker.JAXRSClientIT#checkFutureOkResponse and ee.jakarta.tck.ws.rs.jaxrs21.ee.client.rxinvoker.JAXRSClientIT#checkFutureString methods should be adjusted since they're too optimistic, either with sort some of wait (e.g. in the resource methods themselves), or by simply dropping the !future.isDone() since I'm not actually sure what that's meant to be testing ¯\_(ツ)_/¯

jansupol commented 1 year ago

The check is there to verify the asynchronous-ity of the reactive calls, i.e. that the request was not blocking and the future was not done because of the synchronous blocking request. It is possible the computers are too fast these days to make the async request and the tests should be adjusted to check this in a better way.

jansupol commented 1 year ago

Eh, by blocking I mean blocking the main thread since the rx calls are meant to be in separate threads.

jamezp commented 1 year ago

+1 to this. I was never able to get the TCK to pass with RESTEasy on GitHub Actions because of these issues.

brideck commented 1 year ago

Open Liberty also sees failures in these tests quite often in certain test environments.

spericas commented 1 year ago

We should indeed change these tests to avoid checking doneness this way. If the goal is to verify execution in another thread (asynchronously) we should then store thread info in the call and compare it to the main thread's. In any case, the checks for done in this way should be dropped.

brideck commented 1 year ago

Given the most recent comment, what should the outcome of this challenge be? Normally, these tests would be excluded from the 3.1.x TCK and then added back into a future version when the implementation has been updated.

spericas commented 1 year ago

@brideck We will exclude those tests and fix them in a future release