square / okhttp

Square’s meticulous HTTP client for the JVM, Android, and GraalVM.
https://square.github.io/okhttp/
Apache License 2.0
45.87k stars 9.16k forks source link

Bug: HTTP/2 may send duplicate request on Android devices. #3312

Closed rrfeng closed 7 years ago

rrfeng commented 7 years ago

We have an app on both Android and iOS, and the Android using okhttp/3.6. We tried to start http/2 support on our api server which the apps connected to, Nginx.

Everything was ok, most requests have used http/2, but something wired happend: in the nginx access_log, I found some requests with HUGE request_time, which could be more than 900s or 1000s, all these requests returned http_status_408.

And when I checked on these users(or devices), all these requests sent after users stopping using the app, and at most time these requests are duplicated.

I cannot reproduce this on my own devices, or any test devices in our company. But this should be a bug because at the same time, iOS client was working fine with http/2.

here is the wired access_log: we can find that the user stopped using app at 21:50:20, but sent more requests, after 950 seconds, the nginx log these requests at 22:01:27.

Time            request_time    response_time   http2   status      request  
April 25th 2017, 22:23:25.000   0.021   0.021   h2  200 /messenger/v2/conversations/unread
April 25th 2017, 22:01:27.000   956.083 0.022   h2  408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000   952.644 0.081   h2  408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000   940.531 0.019   h2  408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000   936.136 0.01    h2  408 /v1.1/upgrade/check
April 25th 2017, 22:01:27.000   942.604 0.019   h2  408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000   945.977 0.023   h2  408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000   952.787 0.001   h2  408 /search/v3/config/tabs
April 25th 2017, 22:01:27.000   942.729 0.024   h2  408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000   960.572 0.021   h2  408 /training/v2/running/facade/config?type=3x
April 25th 2017, 22:01:27.000   936.136 0.017   h2  408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000   950.519 0.033   h2  408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000   946.369 0.017   h2  408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000   956.083 0.023   h2  408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 21:50:20.000   0.001   0.001   h2  200 /v1.1/log/client?format=json&priority=normal
April 25th 2017, 21:45:28.000   0.004   0.004       200 /config/v2/basic
April 25th 2017, 21:45:21.000   0.022   0.022   h2  200 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 21:45:21.000   0.012   0.012   h2  200 /messenger/v2/conversations/unread
April 25th 2017, 21:45:21.000   0.019   0.019   h2  200 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 21:45:20.000   0.008   0.008   h2  200 /messenger/v2/conversations/unread
April 25th 2017, 21:45:20.000   0.071   0.071   h2  200 /account/v2/dashboard
April 25th 2017, 21:45:20.000   0.017   0.017   h2  200 /training/v2/event/outdoor
April 25th 2017, 21:45:20.000   0.003   0.003       204 /v1.1/ads/splash?type=android
April 25th 2017, 21:45:20.000   0.015   0.015   h2  200 /social/v2/rankinglist/brief?date=20170425
April 25th 2017, 21:45:20.000   0.009   0.009   h2  200 /v1.1/home/dashboard/statistics
swankjesse commented 7 years ago

Interesting analysis. Would love some help figuring out the root cause.

rrfeng commented 7 years ago

@swankjesse Thanks, and I have no idea for further research. And one more thing: all these requests has different $request_id which generated by Nginx.

EricEdens commented 7 years ago

I suspect this isn't a problem with OkHttp, but rather a case where some timeout tuning would help to deal with flaky connections.

Notice that the client is trying to get the same resource, and the requests are staggered by a few seconds. In a generic scenario, the client sends a request, doesn't hear back from the server, and then sends another. All the while, nginx is waiting for the client's request. It'd be good to check whether those logs that you posted are the 408 responses. It seems fishy that they all happen at the same moment, each with different elapsed time The event at 22:01:27 may be where the client tears down its TCP connections.

Some ideas to investigate on your side:

  1. How long are your read timeouts on your nginx server? [1]
  2. What's your retry policy on the client?
  3. Can you update your nginx logs to include the http response code and the listening port?

[1] http://nginx.org/en/docs/http/ngx_http_core_module.html#client_body_timeout

rrfeng commented 7 years ago

@EricEdens Thanks first.

  1. client_body_timeout is default value: 60s
  2. no retry policy for these apis, for important apis we have retry policies(such as upload data).
  3. I updated the logs, include http response code, and the request /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20 at 21:45:21 (twice actually) got http_status_200. And what's the matter of the listening port? it is listen 80; listen 443 ssl http2; in nginx configuration.
EricEdens commented 7 years ago

Thanks for the extra info @rrfeng.

Nix the request for the listening port; I was trying to tease out which TCP connection was used for which request, and I was actually thinking of the client's​ port, which wouldn't be available in the server logs. I don't know what I was thinking last night :)

If you could come up with a test that reproduces this, that'd be great!

rrfeng commented 7 years ago

I don't know how to log the client_port in the Nginx, but the $connection should be useful for this purpose. I will add this variable to the access_log. Later I will comment if the connection is same for these requests. I still could not reproduce this problem manually, but the http/2 is running now, I can found this kind of logs every 10min.

http://nginx.org/en/docs/http/ngx_http_core_module.html#variables

rrfeng commented 7 years ago

Interesting, I will check this user's operations.

Time    request_time    connection      request  
April 28th 2017, 11:11:59.000   1,057.841   3255947367  /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 11:11:59.000   1,057.841   3255947367  /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:54:20.000   0.019   3255947367  /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:54:20.000   0.02    3255947367  /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:54.000   0.021   3255947366  /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:51.000   0.024   3255947364  /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:51.000   0.025   3255947360  /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:50.000   0.02    3255947367  /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:48.000   0.022   3255947363  /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
rrfeng commented 7 years ago

We have found a bug in our apps, the /search/v3/global api may be called several times at a time. But except this situation, the problem still exists. And I think that the network change(such as wifi -> cellular)/concurrency/app background sleep may relate.

EricEdens commented 7 years ago

Thanks for the update. It sounds like we can close this issue?

On May 1, 2017 9:08 PM, "rrFeng" notifications@github.com wrote:

We have found a bug in our apps, the /search/v3/global api may be called several times at a time. But except this situation, the problem still exists. And I think that the network change(such as wifi -> cellular)/concurrency/app background sleep may relate.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/square/okhttp/issues/3312#issuecomment-298492860, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjIijUr6DfGbHP0dRBzBhdOFoN0j4Y6ks5r1qw8gaJpZM4NJssG .

EricEdens commented 7 years ago

(In the sense that the socket is changing under us. Or do you think the OkHttp should be doing something different here?)

rrfeng commented 7 years ago

It maybe a bug of Nginx, I don't know. If you guys would like to close this issue, then close. I will reopen if I find some proof of OkHttp.

swankjesse commented 7 years ago

Yeah, please do investigate. OkHttp does retries on some calls, but this doesn’t seem like the culprit here.

If you see anything we can fix, please report back!

qiniuhang commented 6 years ago

@rrfeng ,have you found the reason for 408 status,how to solve it? thanks

siddharthg commented 6 years ago

I am facing the same issue, the server used to send a 408 after a request read timeout of 30 seconds, and this resulted in multiple repeated calls.

After increasing the the readtimeout to 290s , the issue changed to

[22/Oct/2018:12:07:27 +0000]  "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="455" resp="0" ua="okhttp/3.11.0"   rt="290.000" uct="-" uht="-" urt="-" "15878207"
[22/Oct/2018:12:07:27 +0000]  "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="453" resp="0" ua="okhttp/3.11.0"   rt="290.000" uct="-" uht="-" urt="-" "15878209"
[22/Oct/2018:12:07:27 +0000]  "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="453" resp="0" ua="okhttp/3.11.0"   rt="290.000" uct="-" uht="-" urt="-" "15878148"
[22/Oct/2018:12:07:27 +0000]  "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="455" resp="0" ua="okhttp/3.11.0"   rt="290.000" uct="-" uht="-" urt="-" "15878246"
[22/Oct/2018:12:07:27 +0000]  "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0"   rt="290.000" uct="-" uht="-" urt="-" "15878247"
[22/Oct/2018:12:07:27 +0000]  "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0"   rt="290.000" uct="-" uht="-" urt="-" "15878229"
[22/Oct/2018:12:07:27 +0000]  "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="455" resp="0" ua="okhttp/3.11.0"   rt="290.000" uct="-" uht="-" urt="-" "15878249"
[22/Oct/2018:12:07:27 +0000]  "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0"   rt="290.000" uct="-" uht="-" urt="-" "15878248"
[22/Oct/2018:12:07:27 +0000]  "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0"   rt="290.000" uct="-" uht="-" urt="-" "15878250"
[22/Oct/2018:12:07:27 +0000]  "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0"   rt="290.000" uct="-" uht="-" urt="-" "15878251"

All of them has same request time, same timeout but different connectionId.

pacuum commented 6 years ago

I am also having the same issue with my react-native Android app which uses okhttp 3.8.1 when talking to nginx + HTTP/2. In our case with probability less than 0.1% it sends multiple requests to server. When it happens, we see 128 multiple requests in our access log within a second. The number is always 128 because our nginx and HTTP/2 rejects concurrent requests more than 128.

However, at the same time, we see a gigantic number of logs like below in our nginx error log. In a worse case, it records a million of lines of this error line from the same IP within a few minutes. I'm afraid it means user sends a very large number of requests.

2018/11/15 07:06:04 [info] 8638#0: *1049430794 concurrent streams exceeded 128 while processing HTTP/2 connection, client: XX.XX.XX.XX, server: 0.0.0.0:443
2018/11/15 07:06:04 [info] 8638#0: *1049430794 concurrent streams exceeded 128 while processing HTTP/2 connection, client: XX.XX.XX.XX, server: 0.0.0.0:443
2018/11/15 07:06:04 [info] 8638#0: *1049430794 concurrent streams exceeded 128 while processing HTTP/2 connection, client: XX.XX.XX.XX, server: 0.0.0.0:443

By disabling HTTP/2, the problem disappeared. The strange thing is it only happens with POST and DELETE, though our app sends more GET and OPTIONS requests than POST and DELETE. It seems more likely to happen with slow APIs (image uploading for example).

swankjesse commented 6 years ago

@pacuum oooh, that’s helpful for debugging this. Perhaps we’re timing out client side and retrying on the same HTTP/2 connection?

pacuum commented 6 years ago

@swankjesse I picked some samples and checked the access log just before the problem happened, and seems you're right. I saw a pattern that, before the repeated requests happens, there is often (more than 80%) 60 second blank in access log from the same IP, which looks like a time-out. But there are still a few case that there is no such blank.

XX.XX.XX.XX - - [15/Nov/2018:12:17:26 +0700] "GET /whats_hot?paging_model[page_size]=40&paging_model[page_index]=1&platform=android HTTP/2.0" 200 24276 "-" "okhttp/3.8.1" "-" 0.618
XX.XX.XX.XX - - [15/Nov/2018:12:18:26 +0700] "POST /tutorials?platform=android HTTP/2.0" 408 0 "-" "okhttp/3.8.1" "-" -
XX.XX.XX.XX - - [15/Nov/2018:12:18:26 +0700] "POST /tutorials?platform=android HTTP/2.0" 408 0 "-" "okhttp/3.8.1" "-" -
XX.XX.XX.XX - - [15/Nov/2018:12:18:26 +0700] "POST /tutorials?platform=android HTTP/2.0" 408 0 "-" "okhttp/3.8.1" "-" -
duxinxiao commented 5 years ago

@pacuum exactly the same with you. After we enable http/2, we found some Android users send around 128 requests in a second. We have to go back to http/1.1

swankjesse commented 5 years ago

Yikes. What HTTP/2 server are you using? Any interesting interceptors? Any other information you provide could be quite helpful in finding the root cause here.

duxinxiao commented 5 years ago

@swankjesse nginx version: nginx/1.10.3 (Ubuntu) all requests' $request_time is 60, $status is 408, $request_method is POST, $body_bytes_sent is 0

Serginho commented 5 years ago

I have the same problem in Javascript with Chrome android and nginx using http/2. We downgraded to http/1.1 and everything works well again.

I would say the problem is http/2 over Android rather than nginx.