Open barbetb opened 2 years ago
What's the bug? Could you rewrite this as a failing test case?
So I have this retry interceptor:
@Slf4j
@RequiredArgsConstructor
public class RetryOnFailureInterceptor implements Interceptor {
private final int maxRetryCount;
@Override
public Response intercept(final Chain chain) throws IOException {
final Request request = chain.request();
int tryCount = 0;
Response response = null;
// first call is actual call, following are first retries
while ((response == null || !response.isSuccessful()) && tryCount < this.maxRetryCount) {
tryCount++;
try {
response = chain.proceed(request);
if (response.isSuccessful()) {
return response;
}
// close response before retry
response.close();
log.info("Intercept, request failed, retry_count={}/{} ", tryCount, this.maxRetryCount);
} catch (final IOException ioException) {
log.info("Caught exception with message={}, retry_count={}/{} ",
ioException.getMessage(), tryCount, this.maxRetryCount);
if (response != null && response.body() != null) {
response.close();
}
}
}
// last try should proceed as is
return chain.proceed(request);
}
}
When I test it like this:
private MockWebServer mockWebServer;
@BeforeEach
void setUp() throws IOException {
this.mockWebServer = new MockWebServer();
this.mockWebServer.start(8080);
}
@AfterEach
void tearDown() throws IOException {
this.mockWebServer.shutdown();
}
@Test
void withRetryInterceptorsTest() throws IOException {
final OkHttpClient httpClient = new CustomOkHttpClientBuilder(getHttpSettings())
.withRetryInterceptor(5)
.build(); //custom client builder I made, it sets timeouts and the interceptor. Also it sets okhttpclient.retryOnConnectionFailure(false) to avoid duplicate retry mechanisms. Also enabling it does not fix it.
final MockResponse timeout1 = new MockResponse().setBodyDelay(20, TimeUnit.DAYS).setHeadersDelay(20, TimeUnit.DAYS);
final MockResponse timeout2 = new MockResponse().setSocketPolicy(SocketPolicy.NO_RESPONSE); //another way to timeout
final MockResponse failure = new MockResponse().setResponseCode(400);
final MockResponse success = new MockResponse().setResponseCode(200);
this.mockWebServer.enqueue(timeout1);
this.mockWebServer.enqueue(timeout2);
this.mockWebServer.enqueue(failure);
this.mockWebServer.enqueue(success);
final Request request = new Request.Builder()
.url("http://localhost:8080")
.get()
.build();
final Response response = httpClient.newCall(request).execute();
assertEquals(200, response.code());
}
Running the tests will cause this:
21:47:05.337 [main] INFO RetryOnFailureInterceptor - Caught exception with message=Socket closed, retry_count=1/5
21:47:05.341 [main] INFO RetryOnFailureInterceptor - Caught exception with message=Canceled, retry_count=2/5
21:47:05.341 [main] INFO RetryOnFailureInterceptor - Caught exception with message=Canceled, retry_count=3/5
21:47:05.342 [main] INFO RetryOnFailureInterceptor - Caught exception with message=Canceled, retry_count=4/5
21:47:05.342 [main] INFO RetryOnFailureInterceptor - Caught exception with message=Canceled, retry_count=5/5
~That looks like the Interceptor is not concurrent thread safe. It doesn't look like it will handle concurrent requests.~
~Can you fix that first and reproduce?~
Hi, what in the code do you see is not thread safe? It's a interceptor on the client and there is no input in it other than the number of retries (which is always fixed). The interceptor works in my integration test with an actual application. Just not with mockserver.
@barbetb apologies, I'm tired, read the response as a field not a local var. You are correct.
@yschimke no worries, thanks for taking time to review this. Would be nice if we found a solution.
Thanks for the repro. I don't get the same cancellations messages that you do, but I could reproduce a similar failure on my macbook.
I added a custom EventListener logged connection events. In my case the socket address is key. My local dns returns 127.0.0.1 then 0:0:0:0:0:0:0:1.
connectStart okhttp3.internal.connection.RealCall@76f7d241 localhost/127.0.0.1:8080
...
connectStart okhttp3.internal.connection.RealCall@76f7d241 localhost/0:0:0:0:0:0:0:1:8080
connectFailed okhttp3.internal.connection.RealCall@76f7d241 localhost/0:0:0:0:0:0:0:1:8080
Can you try with the following DNS override and see what behavior you observe?
.dns(object : Dns {
override fun lookup(hostname: String): List<InetAddress> {
return listOf(Inet4Address.getByAddress(byteArrayOf(127, 0, 0, 1)))
}
})
Yes we are definitely a step further now, but not all problems solved. I made the test a bit simpler, I removed my custom builder, so you can use my test (the custom builder is not the problem, I verified) :
@Test
void withRetryInterceptorTimeoutTest() throws IOException {
startWebserver(8080);
final OkHttpClient httpClient = new OkHttpClient.Builder()
.addInterceptor(new RetryOnFailureInterceptor(4))
.connectTimeout(1000, TimeUnit.MILLISECONDS)
.retryOnConnectionFailure(false)
.dns(new Dns() {
@Override
@SneakyThrows
public List<InetAddress> lookup(final String hostname) {
final byte[] localhost = {127, 0, 0, 1};
return List.of(Inet4Address.getByAddress(localhost));
}
})
.build();
final MockResponse timeout1 = new MockResponse().setBodyDelay(20, TimeUnit.DAYS).setHeadersDelay(20, TimeUnit.DAYS);
final MockResponse timeout2 = new MockResponse().setSocketPolicy(SocketPolicy.NO_RESPONSE);
final MockResponse timeout3 = new MockResponse().setBody("somelong body to cause timeout").throttleBody(1, 50, TimeUnit.SECONDS); //another way to timeout
final MockResponse success = new MockResponse().setResponseCode(200);
this.mockWebServer.enqueue(timeout1);
this.mockWebServer.enqueue(timeout2);
this.mockWebServer.enqueue(timeout3);
this.mockWebServer.enqueue(success); //won't be reached
final Request request = getRequest();
final Response response = httpClient.newCall(request).execute();
assertEquals(200, response.code());
}
As you can see I created 3 ways to create a timeout. Your DNS addition fixed the timeout problem for the timeout2: .setSocketPolicy(SocketPolicy.NO_RESPONSE);
-> If you only do that time out, the test will succeed!
Timeout 1 and timeout 3 are still problematic. It will cause exception:
20:47:16.804 [main] INFO nl.....okhttp.interceptor.RetryOnFailureInterceptor - Caught exception with message=timeout, retry_count=1/4
20:47:26.815 [main] INFO nl.....okhttp.interceptor.RetryOnFailureInterceptor - Caught exception with message=timeout, retry_count=2/4
Jan 05, 2022 8:47:26 PM okhttp3.mockwebserver.MockWebServer$SocketHandler handle
WARNING: MockWebServer[8080] connection from /127.0.0.1 didn't make a request
java.io.IOException: Gave up waiting for queue to shut down
Unfortunatly something else I discovered: if I add
.connectTimeout(1000, TimeUnit.MILLISECONDS)
.callTimeout(300, TimeUnit.MILLISECONDS)
to the client, the test fails again, also for the .setSocketPolicy(SocketPolicy.NO_RESPONSE);
scenario:
21:07:08.481 [main] INFO nl.company.online.appl.okhttp.interceptor.RetryOnFailureInterceptor - Caught exception with message=Socket closed, retry_count=1/4
21:07:08.486 [main] INFO nl.company.online.appl.okhttp.interceptor.RetryOnFailureInterceptor - Caught exception with message=Canceled, retry_count=2/4
21:07:08.486 [main] INFO nl.company.online.appl.okhttp.interceptor.RetryOnFailureInterceptor - Caught exception with message=Canceled, retry_count=3/4
21:07:08.486 [main] INFO nl.company.online.appl.okhttp.interceptor.RetryOnFailureInterceptor - Caught exception with message=Canceled, retry_count=4/4
Jan 05, 2022 9:07:08 PM okhttp3.mockwebserver.MockWebServer$SocketHandler handle
WARNING: MockWebServer[8080] connection from /127.0.0.1 didn't make a request
java.io.InterruptedIOException: timeout
So I can't see anything wrong with the request processing, or the timeouts as observed by client.
this.mockWebServer.enqueue(timeout3);
this.mockWebServer.enqueue(success); //won't be reached
timeout3 is expected to complete the client request as it returns 200 immediately, the interceptor returns that result and so response4 is never needed.
But the remaining issue seems to be that these timeouts get MockWebServer into an unhealthy state where the awaitIdleTask doesn't complete ~even though nothing is in front of it~ because the request processing is blocked. The thread presumably terminated for another reason earlier.
java.io.IOException: Gave up waiting for queues to shut down: [Q10001 false MockWebServer /127.0.0.1:56334:[OkHttp awaitIdle], Q10003 false MockWebServer /127.0.0.1:56388:[]]
at mockwebserver3.MockWebServer.shutdown(MockWebServer.kt:464)
If we can work out why that is happening and fix it, it should solve this problem.
Yes, it reports 'waiting for queues to shut down'. So my initial interpretation is that there is a timeout on the queue of mockwebserver responses, and if one takes too long (longer than the timeout of the response), then it shuts itself down. Unfortunately I'm totally unfamiliar with Kotlin, so I can't help much more with debugging.
I believe the root cause is that the delay operations don't allow for the MockWebServer to react to the socket being closed.
In the failing cases the threads are waiting at
Thread[MockWebServer /127.0.0.1:60302,5,main]
java.base@11.0.11/java.lang.Thread.sleep(Native Method)
app//mockwebserver3.MockWebServer.sleepIfDelayed(MockWebServer.kt:860)
app//mockwebserver3.MockWebServer.writeHttpResponse(MockWebServer.kt:831)
app//mockwebserver3.MockWebServer.access$writeHttpResponse(MockWebServer.kt:100)
app//mockwebserver3.MockWebServer$SocketHandler.processOneRequest(MockWebServer.kt:638)
app//mockwebserver3.MockWebServer$SocketHandler.handle(MockWebServer.kt:564)
app//mockwebserver3.MockWebServer$serveConnection$1.invoke(MockWebServer.kt:481)
app//mockwebserver3.MockWebServer$serveConnection$1.invoke(MockWebServer.kt:479)
app//okhttp3.internal.concurrent.TaskQueue$execute$1.runOnce(TaskQueue.kt:102)
Thread[MockWebServer /127.0.0.1:60351,5,main]
java.base@11.0.11/java.lang.Thread.sleep(Native Method)
app//mockwebserver3.MockWebServer.sleepIfDelayed(MockWebServer.kt:860)
app//mockwebserver3.MockWebServer.throttledTransfer(MockWebServer.kt:917)
app//mockwebserver3.MockWebServer.writeHttpResponse(MockWebServer.kt:839)
app//mockwebserver3.MockWebServer.access$writeHttpResponse(MockWebServer.kt:100)
app//mockwebserver3.MockWebServer$SocketHandler.processOneRequest(MockWebServer.kt:638)
app//mockwebserver3.MockWebServer$SocketHandler.handle(MockWebServer.kt:564)
app//mockwebserver3.MockWebServer$serveConnection$1.invoke(MockWebServer.kt:481)
app//mockwebserver3.MockWebServer$serveConnection$1.invoke(MockWebServer.kt:479)
app//okhttp3.internal.concurrent.TaskQueue$execute$1.runOnce(TaskQueue.kt:102)
The client has no open connections, but the server doesn't react within the shutdown timeout.
A hacky fix in the internals like
private fun sleepIfDelayed(socket: Socket, delayMs: Long) {
if (delayMs > 0L) {
val until = System.currentTimeMillis() + delayMs
var now = System.currentTimeMillis()
while (now < until) {
Thread.sleep((until - now).coerceAtMost(1000))
if (socket.isClosed) {
return
}
now = System.currentTimeMillis()
}
}
}
Makes it work.
As a workaround until we agree on the correct fix, or maybe something we are missing that is already supported, add this to the end of your test to ensure cleanup.
Thread.getAllStackTraces().filterKeys { it.name.startsWith("MockWebServer /127.0.0.1") }.keys.forEach {
it.interrupt()
}
Also in your test you should close the response, the fix we eventually do will probably rely on the client having correctly disposed on requests.
I'll have to wait for a release with the fix, as I cannot use unreleased packages in my project. I'll just have to disable this test.
If anyone wants the cleanup code in java:
Thread.getAllStackTraces().keySet().stream()
.filter(i -> i.getName()
.contains("MockWebServer /127.0.0.1"))
.forEach(k -> k.interrupt());
Are you sure I need to close the response manually in the test? The interceptor I wrote will close the response on failure always. And a success response will always end up closed I think.
You aren't consuming the response body which would close it, and if you try it's the throttled third response so will take a long time. Ideally you should close responses cleanly.
The workaround with cleanup via thread interrupts would be in your test, not depending on an unreleased version.
the sleepIfDelayed kludge is instead of thread interrupts.
Sorry I don't understand how your sleepIfDelayed method works with my test. Where do I call it?
And is this the java equivalent correctly?
@SneakyThrows
private void sleepIfDelayed(final Socket socket, final Long delayMs ) {
if (delayMs > 0L) {
final long until = System.currentTimeMillis() + delayMs;
long now = System.currentTimeMillis();
while (now < until) {
long sleeptime = until - now;
sleeptime = sleeptime < 1000 ? sleeptime : 1000;
Thread.sleep((sleeptime));
if (socket.isClosed()) {
return;
}
now = System.currentTimeMillis();
}
}
}
Don't use sleepIfDelayed (it's an internal change). You should be able to just interrupt the threads. Sorry, I don't always word good.
I know it's an old issue, but is there a fix in an upcoming release? I just reran the tests (I had disabled them so far), but issues seem to be there.
Additionally:
final MockResponse timeout3 = new MockResponse().setBody("somelong body to cause timeout").throttleBody(1, 50, TimeUnit.SECONDS);
Does not generate a timeout. It doesn't seem to do anything.
I think I encountered a bug.
I wrote a custom retry handler, and I wanted to write a test for it. For the retry handler I followed some ideas from here. Basically I catch the IOException and retry.
However if I write a test like, the client will close without retrying.
Exception:
After that exception all requests are cancelled, thus retry is not possible.