square / okhttp

Square’s meticulous HTTP client for the JVM, Android, and GraalVM.
https://square.github.io/okhttp/
Apache License 2.0
45.77k stars 9.15k forks source link

Network calls hang after coming back online after being offline #3278

Open arstk8 opened 7 years ago

arstk8 commented 7 years ago

I have an interesting issue I'm seeing with okhttp, but so far I can only reproduce it on Android emulators (default Android emulator and Genymotion)

If I make a bunch of requests to the same server (enough to saturate the connection pool) and then go offline (airplane mode) and then come back online, any subsequent requests I make to that server hang until one of the connections in the connection pool is removed after 5 minutes of inactivity.

I'm not sure if the emulator thing is a timing issue (emulators execute faster in my experience) or if it's a quirk with the networking configuration vs a real device. The fact that it happens on more than one type of emulator seems significant to me.

I have a sample app which reproduces it here: https://github.com/arstk8/TestEsriNetworkCall/

I know you're preferred way to reproduce is via MockWebserver test case, but I can't figure out a way to reproduce it that way.

swankjesse commented 7 years ago

Thanks for putting together the sample app. If you have any insight into how OkHttp could reliably detect this situation, that’d be handy.

arstk8 commented 7 years ago

I don't, unfortunately. The only thing I can think of is some kind of reachability test, but that seems expensive.

kenumir commented 7 years ago

Basing on this example i make some test and:

This is my observation, I hope this helps

ekstro commented 7 years ago

Setting "keepAliveDuration" also has an impact on this.

yschimke commented 7 years ago

Had a quick look at this with OkHttp 3.9.0

A few comments:

1) Still happening with 3.9.0 2) The EventListener API makes it a lot easier to understand what okhttp is attempting e.g. when it reuses connections from before airplane mode etc. 3) I can't get http/2 to work to that server - I don't think the server supports ALPN 4) Can reproduce with other Http/2 servers, the connection is reestablished after a couple of requests [edit]

One option might be to support platform reachability services, specifically on Android, and test connections after reachability events before handing them out.

yschimke commented 7 years ago

Tagging as enhancement - I think since this is so easily reproducible even on Android 8 with OkHttp 3.9.0, that detecting this cleanly and making it work will be relatively high impact. n.b. I couldn't get into the permanent http/2 failure state that @kenumir saw, so maybe that is fixed.

tucomel commented 6 years ago

not fixed to me, Im using okhttp3.10 with emulator on android api 27 and just after go back online (turn airplane mode off) at first request the the timeout error come out, but after first request everything works

minobilis commented 5 years ago

I had the same issue on emulator and some samsung devices. Forcing okhttp client to use HTTP/1.1 helped me to avoid such exceptions.

new OkHttpClient.Builder() .protocols(Arrays.asList(Protocol.HTTP_1_1)) .build();

It might be due to how HTTP/2 connections established and maintained at OS level on these devices.

heltonandreazza commented 5 years ago

Hey guys,

I thought this was supposed to be fixed here https://github.com/square/okhttp/issues/3261, but I ran @arstk8's sample (https://github.com/arstk8/TestEsriNetworkCall/) with okhttp 3.12.3 and sdk 26, the issue seems to persist.

@minobilis I've also tried your workaround by changing on MainActivity.java:34 new OkHttpClient.Builder().build() to new OkHttpClient.Builder().protocols(Arrays.asList(Protocol.HTTP_1_1)).build() it doesn't fix the problem either.

Apparently the issue presents itself only on Android emulators.

Is there any fix on sight that could be done? Is there any difference between making HTTP requests on emulators and physical devices?

Thanks in advance!!

yschimke commented 4 years ago

Probably worth retesting after https://github.com/square/okhttp/pull/5920 which is in 4.5, I'm hopeful that fixes a ton of connectivity issues with dodgy Android networking.

lyind commented 3 years ago

I think I am still seeing this with OkHttp 4.9.1 on Android 9/10. It is critical for me.

This is a thread dump of the connection thread in question:

OkHttp bla.example.com ( 571 )
        com.android.org.conscrypt.NativeCrypto.SSL_read(Native Method)
        com.android.org.conscrypt.NativeSsl.read(NativeSsl.java:407)
        com.android.org.conscrypt.ConscryptFileDescriptorSocket$SSLInputStream.read(ConscryptFileDescriptorSocket.java:579)
        okio.InputStreamSource.read(JvmOkio.kt:90)  <-- Nothing to read here, socket probably gone, HTTP/2 keep-alive anyone?
        okio.AsyncTimeout$source$1.read(AsyncTimeout.kt:129)
        okio.RealBufferedSource.request(RealBufferedSource.kt:206)
        okio.RealBufferedSource.require(RealBufferedSource.kt:199)
        okhttp3.internal.http2.Http2Reader.nextFrame(Http2Reader.kt:89)
        okhttp3.internal.http2.Http2Connection$ReaderRunnable.invoke(Http2Connection.kt:618)
        okhttp3.internal.http2.Http2Connection$ReaderRunnable.invoke(Http2Connection.kt:609)
        okhttp3.internal.concurrent.TaskQueue$execute$1.runOnce(TaskQueue.kt:98)
        okhttp3.internal.concurrent.TaskRunner.runTask(TaskRunner.kt:116)
        okhttp3.internal.concurrent.TaskRunner.access$runTask(TaskRunner.kt:42)
        okhttp3.internal.concurrent.TaskRunner$runnable$1.run(TaskRunner.kt:65)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
        java.lang.Thread.run(Thread.java:764)

Is that the same issue or a different one? Now trying to limit the protocol used to HTTP/1.1.

I have the client configured like this:

        this.client = builder
                .addInterceptor(new DefaultHeaderInterceptor(userAgent))
                .retryOnConnectionFailure(true)
                .pingInterval(8, TimeUnit.SECONDS)
                .connectTimeout(9, TimeUnit.SECONDS)
                .readTimeout(7, TimeUnit.SECONDS)
                .writeTimeout(7, TimeUnit.SECONDS)
                .connectionPool(new ConnectionPool(1, 25, TimeUnit.SECONDS))
                ...
                .build();
lyind commented 3 years ago

Limiting Protocol to HTTP/1.1 did not help much. The exact same issues still occurs.

This bug is especially easy to notice when restricting the ConnectionPool to 1 idle connection (as then all HTTP requests will hang when the issue occurs).

Ideas: At least for HTTP2 it should be possible to walk the ConnectionPool's IDLE connections and check if there was any Pong received recently (if timeout enabled) and forcefully close() or interrupt() the connection/thread. That should fix the issue. For HTTP/1.1 maybe one could use some "last response received" timestamp together with the connection timeout?

korilin commented 1 year ago

When I use EventListener to collect network time statistics, I find some connection timeout longer than OkHttpClient setting.

callStart(302619159)->proxySelectStart(302619209)->withRequestStart(302619210)->proxySelectEnd(302619210)->dnsStart(302619210)->proxySelectStart(302619221)->withRequestStart(302619221)->proxySelectEnd(302619222)->dnsStart(302619222)->dnsEnd(302619234)->connectStart(302619237)->connectFailed(302639262)->connectStart(302639264)->connectFailed(302647219)->proxySelectStart(302647264)->withRequestStart(302647264)->proxySelectEnd(302647265)->dnsStart(302647265)->proxySelectStart(302647275)->withRequestStart(302647275)->proxySelectEnd(302647275)->dnsStart(302647275)->dnsEnd(302647284)->connectStart(302647285)->connectFailed(304721621)->canceled(304721622)->withRequestEnd(304721622)->callFailed(304721639)->withRequestEnd(304721639)

The problem fragment is connectStart(302647285)->connectFailed(304721621)->canceled(304721622). It took 2074336 ms. The data of this abnormal time is more than 3% of the reported data in our project.

OKHttpClient Setting:

Other Info:

Is this the same problem mentioned in this issue?

lyind commented 1 year ago

Is this the same problem mentioned in this issue?

It sounds at least somewhat related.

It's been some time, but IIRC the problem is that in certain situations the HTTP library really can't do much as it doesn't receive a timely response from the OS. Some active watcher thread/timer and more complicated logic would be required.

yschimke commented 1 year ago

The other approach which we haven't wanted to get into, is to subscribe to android network changes.

https://github.com/square/okhttp/issues/7470

ln-12 commented 4 months ago

We still have this issue with version 4.12.0 while using it with connect. What I noticed is that is occurs only in streaming requests. However, non-streaming requests seem to be blocked until the streaming call times out. Also, when I toggle the network connection, only every second streaming request fails. When I do it again, it then works. Doing it another time, it fails again.

I tried to listen to the network state and call dispatcher.cancelAll() on my custom dispatcher when the connection is lost but this also did not work. Are there any other ideas for a workaround?

The exception I am getting after the call timeout is:

Unhandled exception
    okhttp3.internal.http2.StreamResetException: stream was reset: CANCEL
        at okhttp3.internal.http2.Http2Stream.takeHeaders(Http2Stream.kt:148)
        at okhttp3.internal.http2.Http2ExchangeCodec.readResponseHeaders(Http2ExchangeCodec.kt:97)
        at okhttp3.internal.connection.Exchange.readResponseHeaders(Exchange.kt:110)
        at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.kt:93)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at com.connectrpc.okhttp.ConnectOkHttpClient$applyNetworkInterceptor$1.intercept(ConnectOkHttpClient.kt:61)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:34)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at com.example.network.ChainExceptionInterceptor.intercept(ChainExceptionInterceptor.kt:15)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.kt:154)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
        at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:644)
        at java.lang.Thread.run(Thread.java:1012)
ln-12 commented 4 months ago

I think that I found out what is actually happening on emulators (but not only on them)!

After trying out many things, I added a delay between reconnecting to the network and sending the first request. I ended up at ~2 seconds. Then I noticed that this is the time in which the emulator usually connects to the wifi (!) network after airplane mode. With the delay, it worked perfectly fine. Now let's have a look at what is happening without the delay (make sure cellular network and wifi are enabled, this is the default on emulators):

I assume that the network request was (sometimes) fired exactly in the timeframe between 3G being connected and wifi being connected. After the call was fired to the 3G network connection it just timed out as the emulator killed that interface (?) when switching over to wifi. The response from the server was sent but never reached the client through the 3G connection as is might only be ready to receive data over wifi.

This never happened on my Pixel phone during testing as wifi is still turned on in airplane mode on these devices. Samsung for example behaves the same as the emulator and turns on cellular first before it switched to wifi to this issue is also reproducible on these devices.

@swankjesse @yschimke I hope this helps to understand the actual issue here. Is there something that you could change inside OkHttp in this case?