lpereira / lwan

Experimental, scalable, high performance HTTP server
https://lwan.ws
GNU General Public License v2.0
5.92k stars 549 forks source link

Respond very slowed when POST body large than 1024 #287

Closed fangxinmiao closed 4 years ago

fangxinmiao commented 4 years ago

For example, the following command is very fast:

curl -i -X POST "http://172.30.8.82:8080/world" -d "appname=foo&appkey=bar"

But, if I increase the POST body size large than 1024, it is very solow. For example:

curl -i -X POST "http://localhost:8080/world" -d "appname=foo&appkey=bar&algo=1234567890123456789012345678901234567890&prikey=MIICdwIBADANBgkqhkiG9w0BAQEFAASCAmEwggJdAgEAAoGBAMci8EEQm6kRyIXmD3/ckp98JZjS4QT2CXAVygr4i6EaZhm6pFQlWSujAuuG9QnbskGf3bcKjdysPs622wmtrJR6Bl330JxmJixS+FJyi18J7G6x99pn4NiExiaNlQXM7WMKh1308AYRK4qqfkwTLgYaGo7GgqVoXIS2tgYw+nWPAgMBAAECgYBWPuuPQoIoV2thkqJj3kQijsI5zhtm6qhrwwQefEb9H1HZEtzZFNPp8k/xciHugxhRE8wS2p9WU8NgBBIZvv5fDGaLbHAO2cEtJoO06S4mQKw6ZfRvrUJaNVp5mNasjwA8QfXhawmr3zF13MK4D5qwmc5UqrnU03OAI2YQ+F/tAQJBAOIe+rWQV+Zh2Fv3415WqMe0rstwIohpHYjefGn+R3fM+d6haRxodt6ZTfNWpwkgf4xPKWq8jwCSvr8WJkxZy50CQQDhcyZ6kka4MMjIq+lVsYrVPm6i3GAegutB2fcGSsodTxCMIas9u5d8xNfcc/B9nh533po8wuq0e/WrqcrWnywbAkA02DCXRgiE2tEJjUEhxvbi1SMUgDxRzN8OIZDJ5JcnNEwg72/1J6TAGOvStqmKxIGKdOWlADQbWjAh0DdbGJExAkEA2kKXkvyAGJuqFpRP7lVWugGaLHM7Xu6SAYuZ2e6X424C8Bc8aCRvkSwkatmXjpWAm73XBlIR1S9Y9JsJs2Qf1wJBAIirbU5JXVedimYaahxGRNdOFlj0sdUOrXa8sEYrGrTcRZjxpcP+ph0XdpzhexN2fQc2vllP/wDDT5EowN7fZjo=&data=MTIzNDU2Nzg5MDEyMzQ1Njc4OTAxMjM0NTY3ODkwMTIzNDU2Nzg5MDEyMzQ1Njc4OTAxMjM0NTY3ODkwMTIzNDU2Nzg5MGFiY2RlZmc="

I try to modify the DEFAULT_BUFFER_SIZE and CORO_BUMP_PTR_ALLOC_SIZE, but it didn't work.

lpereira commented 4 years ago

Can't reproduce here.

I'm testing with testrunner's /hello endpoint, with ?dump_vars=1 as the query string to force it to parse the request body, and it's pretty much instantaneous. Looking at strace, it also looks fine: it reads the request up to and including DEFAULT_BUFFER_SIZE bytes, then immediately makes another call to read(), for the remaining of that request body; it then proceeds to send the response without skipping a beat. Running curl under time with a short body or a long body doesn't seem to affect much, if at all.

Since I can't reproduce here, I would appreciate you running your test application built as debug mode under strace, with at least -Tfkyy as parameters, and posting the logs from when the request reaches Lwan up to and including when it's finally serviced. This should help me find at least where things are taking more time than they should. If you need help with strace just ask.

Also, are you using a recent enough Lwan? There has been some changes in that general vicinity that might have affected this. If it's a recent Lwan and you can't reproduce with an older Lwan (say, from a few months ago), I'd suggest bisecting to pinpoint the problematic commit.

fangxinmiao commented 4 years ago

I try some clients: curl, java httpclient, python, and postman, etc, only curl have this problem. When the post body large than 1024, it will call coro_yield(request->conn->coro, CONN_CORO_WANT_READ); in lwan-request.c:client_read() function. I checkout some versions some months ago, there have the same problem.

lpereira commented 4 years ago

Can you run lwan with "strace -o/tmp/lwan.log -f /path/to/lwan" and post the last few hundred lines of the log? There are multiple occasions where it might yield and I'd like to know why.

On Wed, Aug 5, 2020, 02:28 Fang Xin Miao notifications@github.com wrote:

I try some clients: curl, java httpclient, python, and postman, etc, only curl have this problem. When the post body large than 1024, it will call coro_yield(request->conn->coro, CONN_CORO_WANT_READ); in lwan-request.c:client_read() function. I checkout some versions some months ago, there have the same problem.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lpereira/lwan/issues/287#issuecomment-669087287, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADVGO6VGBL6ZBOWCZKXC3R7EQ2XANCNFSM4PULFX4A .

lpereira commented 4 years ago

I think I found the issue and committed a tentative fix; could you please retry with the latest commit and report back if it actually addresses the issue?

fangxinmiao commented 4 years ago

The fix didn't solve problem.

lpereira commented 4 years ago

I'm stumped, then. Since I can't reproduce here, I'll need the strace logs with the timing information as requested above. I'll probably need more things later if that ends up not helping.

lpereira commented 4 years ago

If there was a previous version of Lwan that didn't show this problem, I'd appreciate to know which revision it was, too. Even better if you could help me bisect this (using git bisect).

fangxinmiao commented 4 years ago

I am not sure which commit introduced this issue, I use the version which commit id is 03e51492605d97ffb8dffd70c600f8c540a9ef19. Now I put the demo source at here https://dl.fangxinmiao.work/lwan-demo.tar.gz. You can run the run-curl.sh to verify it. On my environment (CentOS:7.3, GCC:4.8.5, CURL: 7.29.0), the result is:

fangxm# ./run-curl.sh 
used(ms): 12
used(ms): 13
used(ms): 13
used(ms): 12
used(ms): 13
used(ms): 12
used(ms): 13
used(ms): 13
used(ms): 13
used(ms): 13

used(ms): 1014
used(ms): 1015
used(ms): 1014
used(ms): 1015
used(ms): 1014
used(ms): 1015
used(ms): 1015
used(ms): 1015
used(ms): 1015
used(ms): 1015
lpereira commented 4 years ago

03e5149260 doesn't include the change I mentioned here, did you apply a07f9491 before testing?

(I'll try installing this CentOS on a VM over the weekend to investigate the issue and see if I can reproduce it here. Thanks for providing the script to reproduce the issue)

lpereira commented 4 years ago

I tried your reproducer with a slightly newer CentOS (used 7.8 as 7.3 isn't easily available), and Lwan doesn't even build with a GCC that old. It requires some simple patches to build and run in a kernel that old. Even then, I could still not reproduce the issue, testing either over localhost or over the network.

Since I can't test locally, could you please provide the strace log? strace -fkTttyy -o/tmp/lwan.log /path/to/lwan should give me some insight as to what is happening.

fangxinmiao commented 4 years ago

I upload the log file at here https://dl.fangxinmiao.work/lwan-strace-log.tar.gz, lwan-fast.log is the log on fast reply, and lwan-slow.log is the log on very slow reply. I sent a total of 10 requests.

lpereira commented 4 years ago

Looking at the syscall trace, it seems that, although you're indeed correct that POST requests sometimes take a long time to complete, it seems that Lwan is working as designed here, and that might be a problem in the client code.

First, it reads the request (177 bytes), determines that it needs to read the request body because it wasn't sent in the same segment. Which it then immediately tries to do:

3:24:14.243416 read(10<TCP:[127.0.0.1:8080->127.0.0.1:57612]>, "POST /hello HTTP/1.1\r\nUser-Agent"..., 4095) = 177 <0.000012>
13:24:14.243541 read(10<TCP:[127.0.0.1:8080->127.0.0.1:57612]>, 0x7f1d7c00bed0, 1025) = -1 EAGAIN (Resource temporarily unavailable) <0.000008>

But there's nothing in the socket yet (-EAGAIN), so it yields... until Epoll tells there's something to be read from, ~1s later:

13:24:14.243665 epoll_wait(5<anon_inode:[eventpoll]>, [], 1024, 971) = 0 <0.972041>
13:24:15.216022 epoll_wait(5<anon_inode:[eventpoll]>, [], 1024, 12) = 0 <0.012116>
13:24:15.228497 epoll_wait(5<anon_inode:[eventpoll]>, [{EPOLLIN, {u32=10337152, u64=10337152}}], 1024, 946) = 1 <0.014310>

The first two calls to epoll_wait() with short timeouts are due to how timers are processed in Lwan. The important thing to focus here is in the 3rd call: the events array contains one EPOLLIN event that will be promptly processed by Lwan, reading 1025 bytes in a single call:

13:24:15.243133 read(10<TCP:[127.0.0.1:8080->127.0.0.1:57612]>, "appname=foo&appkey=bar&algo=sha1"..., 1025) = 1025 <0.000008>

And then it'll immediately proceed to send out the response:

13:24:15.243265 sendto(10<TCP:[127.0.0.1:8080->127.0.0.1:57612]>, "HTTP/1.1 200 OK\r\nContent-Length:"..., 184, 0, NULL, 0) = 184 <0.000021>

You might want to capture an strace log from Curl as well, with the timing information, so we can understand what's going on here. Curl 7.29 is from 2013 so there's a good chance that if this was a bug in it, it's been fixed already (you might want to try building a newer version and checking it out to confirm the theory).

When you compare with the "fast" requests, the figure is different:

13:20:53.984328 read(10<TCP:[127.0.0.1:8080->127.0.0.1:57574]>, "POST /hello HTTP/1.1\r\nUser-Agent"..., 4095) = 1179 <0.000034>
13:20:53.984484 sendto(10<TCP:[127.0.0.1:8080->127.0.0.1:57574]>, "HTTP/1.1 200 OK\r\nContent-Length:"..., 184, 0, NULL, 0) = 184 <0.000039>

This clearly shows that Curl was able to send the whole request in a single segment, so Lwan was able to parse, determine everything was alright, and then proceed to send out the response immediately without having to wait on the socket to be ready to be read from again.

lpereira commented 4 years ago

I was able to reproduce this, and, indeed, that version of Curl is sending the request body in a second send(), ~1s after it sends the request headers:

1677  22:15:57.820888 sendto(4, "POST /hello\\?dump_vars=1 HTTP/1."..., 190, MSG_NOSIGNAL, NULL, 0) = 190 <0.000052>
1677  22:15:57.820985 poll([{fd=4, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout) <0.000007>
1677  22:15:57.821021 poll([{fd=4, events=POLLIN}], 1, 1000) = 0 (Timeout) <1.001159>
1677  22:15:58.822327 poll([{fd=4, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout) <0.000028>
1677  22:15:58.822487 poll([{fd=4, events=POLLIN}, {fd=4, events=POLLOUT}], 2, 1000) = 1 ([{fd=4, revents=POLLOUT}]) <0.000027>
1677  22:15:58.822617 poll([{fd=4, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}, {fd=4, events=POLLOUT|POLLWRNORM}], 2, 0) = 1 ([{fd=4, revents=POLLOUT|POLLWRNORM}]) <0.000025>
1677  22:15:58.822755 sendto(4, "a=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 2002, MSG_NOSIGNAL, NULL, 0) = 2002 <0.000179>

It only did that because the last call to poll() right before the sendto() call had a POLLOUT revent.

A more recent version of Curl seems to avoid this by making a single sendto() call with the whole request:

274032 19:35:18 sendto(5, "POST /beacon HTTP/1.1\r\nHost: 192"..., 4619, MSG_NOSIGNAL, NULL, 0) = 4619 <0.000036>

So it doesn't block waiting on poll() like the version from CentOS 7 does. I haven't tested with larger responses.

At this point, I don't know if this is the fault of the old version of Curl, Lwan, or the old, heavily patched kernel.

lpereira commented 4 years ago

As this doesn't seem to be a bug in Lwan, I'm closing this issue. Please feel free to reopen if you end up finding evidence of the contrary (or open another issue).