Closed rrfeng closed 7 years ago
Interesting analysis. Would love some help figuring out the root cause.
@swankjesse Thanks, and I have no idea for further research.
And one more thing: all these requests has different $request_id
which generated by Nginx.
I suspect this isn't a problem with OkHttp, but rather a case where some timeout tuning would help to deal with flaky connections.
Notice that the client is trying to get the same resource, and the requests are staggered by a few seconds. In a generic scenario, the client sends a request, doesn't hear back from the server, and then sends another. All the while, nginx is waiting for the client's request. It'd be good to check whether those logs that you posted are the 408 responses. It seems fishy that they all happen at the same moment, each with different elapsed time The event at 22:01:27 may be where the client tears down its TCP connections.
Some ideas to investigate on your side:
[1] http://nginx.org/en/docs/http/ngx_http_core_module.html#client_body_timeout
@EricEdens Thanks first.
client_body_timeout
is default value: 60s/search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
at 21:45:21
(twice actually) got http_status_200. And what's the matter of the listening port? it is listen 80; listen 443 ssl http2;
in nginx configuration.Thanks for the extra info @rrfeng.
Nix the request for the listening port; I was trying to tease out which TCP connection was used for which request, and I was actually thinking of the client's port, which wouldn't be available in the server logs. I don't know what I was thinking last night :)
If you could come up with a test that reproduces this, that'd be great!
I don't know how to log the client_port in the Nginx, but the $connection
should be useful for this purpose. I will add this variable to the access_log. Later I will comment if the connection
is same for these requests.
I still could not reproduce this problem manually, but the http/2 is running now, I can found this kind of logs every 10min.
http://nginx.org/en/docs/http/ngx_http_core_module.html#variables
Interesting, I will check this user's operations.
Time request_time connection request
April 28th 2017, 11:11:59.000 1,057.841 3255947367 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 11:11:59.000 1,057.841 3255947367 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:54:20.000 0.019 3255947367 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:54:20.000 0.02 3255947367 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:54.000 0.021 3255947366 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:51.000 0.024 3255947364 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:51.000 0.025 3255947360 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:50.000 0.02 3255947367 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:48.000 0.022 3255947363 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
We have found a bug in our apps, the /search/v3/global
api may be called several times at a time. But except this situation, the problem still exists.
And I think that the network change(such as wifi -> cellular)
/concurrency
/app background sleep
may relate.
Thanks for the update. It sounds like we can close this issue?
On May 1, 2017 9:08 PM, "rrFeng" notifications@github.com wrote:
We have found a bug in our apps, the /search/v3/global api may be called several times at a time. But except this situation, the problem still exists. And I think that the network change(such as wifi -> cellular)/concurrency/app background sleep may relate.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/square/okhttp/issues/3312#issuecomment-298492860, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjIijUr6DfGbHP0dRBzBhdOFoN0j4Y6ks5r1qw8gaJpZM4NJssG .
(In the sense that the socket is changing under us. Or do you think the OkHttp should be doing something different here?)
It maybe a bug of Nginx, I don't know. If you guys would like to close this issue, then close. I will reopen if I find some proof of OkHttp.
Yeah, please do investigate. OkHttp does retries on some calls, but this doesn’t seem like the culprit here.
If you see anything we can fix, please report back!
@rrfeng ,have you found the reason for 408 status,how to solve it? thanks
I am facing the same issue, the server used to send a 408 after a request read timeout of 30 seconds, and this resulted in multiple repeated calls.
After increasing the the readtimeout to 290s , the issue changed to
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="455" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878207"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="453" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878209"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="453" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878148"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="455" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878246"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878247"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878229"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="455" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878249"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878248"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878250"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878251"
All of them has same request time, same timeout but different connectionId.
I am also having the same issue with my react-native Android app which uses okhttp 3.8.1 when talking to nginx + HTTP/2. In our case with probability less than 0.1% it sends multiple requests to server. When it happens, we see 128 multiple requests in our access log within a second. The number is always 128 because our nginx and HTTP/2 rejects concurrent requests more than 128.
However, at the same time, we see a gigantic number of logs like below in our nginx error log. In a worse case, it records a million of lines of this error line from the same IP within a few minutes. I'm afraid it means user sends a very large number of requests.
2018/11/15 07:06:04 [info] 8638#0: *1049430794 concurrent streams exceeded 128 while processing HTTP/2 connection, client: XX.XX.XX.XX, server: 0.0.0.0:443
2018/11/15 07:06:04 [info] 8638#0: *1049430794 concurrent streams exceeded 128 while processing HTTP/2 connection, client: XX.XX.XX.XX, server: 0.0.0.0:443
2018/11/15 07:06:04 [info] 8638#0: *1049430794 concurrent streams exceeded 128 while processing HTTP/2 connection, client: XX.XX.XX.XX, server: 0.0.0.0:443
By disabling HTTP/2, the problem disappeared. The strange thing is it only happens with POST and DELETE, though our app sends more GET and OPTIONS requests than POST and DELETE. It seems more likely to happen with slow APIs (image uploading for example).
@pacuum oooh, that’s helpful for debugging this. Perhaps we’re timing out client side and retrying on the same HTTP/2 connection?
@swankjesse I picked some samples and checked the access log just before the problem happened, and seems you're right. I saw a pattern that, before the repeated requests happens, there is often (more than 80%) 60 second blank in access log from the same IP, which looks like a time-out. But there are still a few case that there is no such blank.
XX.XX.XX.XX - - [15/Nov/2018:12:17:26 +0700] "GET /whats_hot?paging_model[page_size]=40&paging_model[page_index]=1&platform=android HTTP/2.0" 200 24276 "-" "okhttp/3.8.1" "-" 0.618
XX.XX.XX.XX - - [15/Nov/2018:12:18:26 +0700] "POST /tutorials?platform=android HTTP/2.0" 408 0 "-" "okhttp/3.8.1" "-" -
XX.XX.XX.XX - - [15/Nov/2018:12:18:26 +0700] "POST /tutorials?platform=android HTTP/2.0" 408 0 "-" "okhttp/3.8.1" "-" -
XX.XX.XX.XX - - [15/Nov/2018:12:18:26 +0700] "POST /tutorials?platform=android HTTP/2.0" 408 0 "-" "okhttp/3.8.1" "-" -
@pacuum exactly the same with you. After we enable http/2, we found some Android users send around 128 requests in a second. We have to go back to http/1.1
Yikes. What HTTP/2 server are you using? Any interesting interceptors? Any other information you provide could be quite helpful in finding the root cause here.
@swankjesse nginx version: nginx/1.10.3 (Ubuntu) all requests' $request_time is 60, $status is 408, $request_method is POST, $body_bytes_sent is 0
I have the same problem in Javascript with Chrome android and nginx using http/2. We downgraded to http/1.1 and everything works well again.
I would say the problem is http/2 over Android rather than nginx.
We have an app on both Android and iOS, and the Android using okhttp/3.6. We tried to start http/2 support on our api server which the apps connected to, Nginx.
Everything was ok, most requests have used http/2, but something wired happend: in the nginx access_log, I found some requests with
HUGE request_time
, which could be more than 900s or 1000s, all these requests returned http_status_408.And when I checked on these users(or devices), all these requests sent after users stopping using the app, and at most time these requests are duplicated.
I cannot reproduce this on my own devices, or any test devices in our company. But this should be a bug because at the same time, iOS client was working fine with http/2.
here is the wired access_log: we can find that the user stopped using app at 21:50:20, but sent more requests, after 950 seconds, the nginx log these requests at 22:01:27.