Open Ryan-A-B opened 3 years ago
Interesting test case. I'm not sure what is going on here. I can see that BenchmarkFileWithContentLength
uses the sendfile
system call while BenchmarkFileWithoutContentLength
does not, but I don't know why it would lead to such a significant difference in benchmark time.
If I use a much larger file, rather than the main_test.go
file, then BenchmarkFileWithContentLength
is faster. In that case BenchmarkFileWithoutContentLength
is going to be using more system calls.
Perhaps sendfile
is not a great choice for short files, but I don't know why the difference would be so dramatic.
The problem is in chunkWriter.writeHeader
, when Content-Length
header is detected, chunkWriter.chunking
won't be set. That causes response.ReadFrom
use the net.Conn.ReadFrom
instead of wrapping the response with writerOnly
.
The fix can be enable chunking even in case Content-Length
header was detected, but I'm not sure about it.
Any idea @bradfitz @odeke-em
On macos I am getting both benchmarks close to each other.
goos: darwin
goarch: amd64
BenchmarkFileWithoutContentLength-8 11030 106178 ns/op
BenchmarkFileWithContentLength-8 10000 106606 ns/op
PASS
@darkLord19 That's because macOS doesn't have sendfile
.
@cuonglm Thanks for spotting where the behavior changes. I don't think it would be right to always enable chunking because as noted above using sendfile
is actually faster when using a larger file.
I think we need to understand why sendfile
is so much slower for short files.
CC @acln0 for any thoughts.
When I tried to reproduce on my machine, I got this:
: forge:http-sendfile-bench $ go test -bench .
goos: linux
goarch: amd64
pkg: acln.ro/http-sendfile-bench
BenchmarkFileWithoutContentLength-4 16884 70872 ns/op
BenchmarkFileWithContentLength-4 18878 62345 ns/op
PASS
ok acln.ro/http-sendfile-bench 3.743s
... which looks quite reasonable. I don't know what is going on here.
I have a theory: I notice that the benchmark calls client.Get
, but never closes response.Body
. It is my understanding that this can cause a file descriptor leak. Perhaps the first round of benchmarks leaves just enough file descriptors hanging around that by the time we get to the second one, the process is reaching a file descriptor limit, and the server enters this portion of its internal Accept
loop: https://github.com/golang/go/blob/617f2c3e35cdc8483b950aa3ef18d92965d63197/src/net/http/server.go#L2962-L2976
The server sleeping for extended periods of time may account for the sendfile code path looking really really slow. I cannot imagine sendfile needing ~50ms to send a small file otherwise.
I have not been able to confirm my hypothesis, but I would kindly ask @Ryan-A-B to change the inner loop of the benchmark to this:
for i := 0; i < b.N; i++ {
response, err := client.Get(server.URL)
if err != nil {
log.Fatal(err.Error())
}
_, err = ioutil.ReadAll(response.Body)
if err != nil {
log.Fatal(err.Error())
}
response.Body.Close()
}
and try again. Perhaps with this change, the benchmark would have more sensible results.
@acln0 adding response.Body.Close() made no difference, however pointing at a smaller file did, it seems like files over a certain size experience the slowdown when stating Content-Length. I think that makes sense based on @cuonglm findings.
@darkLord19 That's because macOS doesn't have
sendfile
.@cuonglm Thanks for spotting where the behavior changes. I don't think it would be right to always enable chunking because as noted above using
sendfile
is actually faster when using a larger file.I think we need to understand why
sendfile
is so much slower for short files.CC @acln0 for any thoughts.
I don't mean we always enable chunking, but still enable chunking when we detect there's Content-Length
header. The current code which enable chunking is:
Which happens if hasCL
is false
. But after those code, there's:
which seems useless, since when chunking is only set if Content-Length
header was not existed. So I propose:
diff --git a/src/net/http/server.go b/src/net/http/server.go
index 25fab288f2..909adeb67f 100644
--- a/src/net/http/server.go
+++ b/src/net/http/server.go
@@ -1429,17 +1429,17 @@ func (cw *chunkWriter) writeHeader(p []byte) {
if w.req.Method == "HEAD" || !bodyAllowedForStatus(code) {
// do nothing
- } else if code == StatusNoContent {
+ } else if code == StatusNoContent || hasCL {
delHeader("Transfer-Encoding")
- } else if hasCL {
- delHeader("Transfer-Encoding")
- } else if w.req.ProtoAtLeast(1, 1) {
+ }
+
+ if w.req.ProtoAtLeast(1, 1) {
// HTTP/1.1 or greater: Transfer-Encoding has been set to identity, and no
// content-length has been provided. The connection must be closed after the
// reply is written, and no chunking is to be done. This is the setup
// recommended in the Server-Sent Events candidate recommendation 11,
// section 8.
- if hasTE && te == "identity" {
+ if !hasCL && hasTE && te == "identity" {
cw.chunking = false
w.closeAfterReply = true
} else {
Note
The slow performance come from this line: https://github.com/golang/go/blob/23573d0ea225d4b93ccd2b946b1de121c3a6cee5/src/net/http/server.go#L616
@Ryan-A-B Just to make sure we are on the same page, I want to clarify that what I see is that larger files are faster in BenchmarkFileWithContentLength
. Smaller files are faster in BenchmarkFileWithoutContentLength
.
Running strace -ftt
, it seems to me that it takes an unreasonably long time for epoll_pwait
to return after sendfile
is used to send data on the socket. Relevant lines from strace -ftt
output:
[pid 1686887] 15:12:57.528572 sendfile(8, 9, NULL, 4194304 <unfinished ...>
[pid 1686887] 15:12:57.528647 <... sendfile resumed> ) = 1080
[pid 1686887] 15:12:57.529057 close(9 <unfinished ...>
[pid 1686887] 15:12:57.529131 <... close resumed> ) = 0
[pid 1686894] 15:12:57.529327 epoll_pwait(5, <unfinished ...>
[pid 1686894] 15:12:57.736463 <... epoll_pwait resumed> [{EPOLLIN|EPOLLOUT, {u32=3680355512, u64=139847815513272}}], 128, -1, NULL, 0) = 1
Note the huge jump from 57.529327
to 57.736463
before epoll_pwait
returns. During that interval the strace reports nothing of any interest. Why does it take epoll_pwait
to long to return?
My Linux kernel version is 5.7.17.
strace client and server separately, show there is a 200ms delay between sendfile and recvfrom.
07:22:16.141964 connect(3, {sa_family=AF_INET, sin_port=htons(8081), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
07:22:16.142104 poll([{fd=3, events=POLLOUT|POLLWRNORM}], 1, 0) = 1 ([{fd=3, revents=POLLOUT|POLLWRNORM}])
07:22:16.142381 sendto(3, "GET / HTTP/1.1\r\nHost: 127.0.0.1:"..., 78, MSG_NOSIGNAL, NULL, 0) = 78
07:22:16.142458 poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
07:22:16.142511 poll([{fd=3, events=POLLIN}], 1, 199) = 1 ([{fd=3, revents=POLLIN}])
07:22:16.142964 poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=3, revents=POLLIN|POLLRDNORM}])
07:22:16.143005 recvfrom(3, "HTTP/1.1 200 OK\r\nContent-Length:"..., 102400, 0, NULL, NULL) = 631
07:22:16.143215 poll([{fd=3, events=POLLIN}], 1, 198) = 0 (Timeout)
07:22:16.342150 poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
07:22:16.342346 poll([{fd=3, events=POLLIN}], 1, 0) = 0 (Timeout)
07:22:16.342487 poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
07:22:16.342583 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLIN}])
07:22:16.348691 poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=3, revents=POLLIN|POLLRDNORM}])
07:22:16.348754 recvfrom(3, "llback func(response http.Respon"..., 959, 0, NULL, NULL) = 959
[pid 7073] 07:22:16.142734 write(4, "HTTP/1.1 200 OK\r\nContent-Length:"..., 631) = 631
[pid 7073] 07:22:16.142825 sendfile(4, 6, NULL, 4194304) = 959
[pid 7073] 07:22:16.142864 sendfile(4, 6, NULL, 4194304) = 0
FYI, I got this on my Windows PC.
goos: windows
goarch: amd64
pkg: github.com/mattn/http_bench
BenchmarkFileWithoutContentLength-4 679 1651679 ns/op
BenchmarkFileWithContentLength-4 727 1602470 ns/op
PASS
@ianlancetaylor up to 512 bytes they're equal then the timing from 513 up BenchmarkFileWithContentLength
goes from being the slow version to the fast version, crossing at about 1.2MB
256 byte file BenchmarkFileWithoutContentLength-8 20000 104532 ns/op BenchmarkFileWithContentLength-8 20000 90245 ns/op
512 byte file BenchmarkFileWithoutContentLength-8 20000 94685 ns/op BenchmarkFileWithContentLength-8 20000 86804 ns/op
513 byte file BenchmarkFileWithoutContentLength-8 20000 96399 ns/op BenchmarkFileWithContentLength-8 20 52140028 ns/op
1024 byte file BenchmarkFileWithoutContentLength-8 20000 97876 ns/op BenchmarkFileWithContentLength-8 20 52145628 ns/op
1.2MB file BenchmarkFileWithoutContentLength-8 1000 1242620 ns/op BenchmarkFileWithContentLength-8 2000 1389453 ns/op
3.6MB file BenchmarkFileWithoutContentLength-8 500 2606823 ns/op BenchmarkFileWithContentLength-8 1000 1653781 ns/op
@ianlancetaylor up to 512 bytes they're equal then the timing from 513 up
BenchmarkFileWithContentLength
goes from being the slow version to the fast version, crossing at about 1.2MB
Since the Content-Type detection sniff 512 bytes, so less 512 bytes will not invoke sendfile.
my mac os benchmark:
goos: darwin
goarch: amd64
pkg: test/ut
BenchmarkFileWithoutContentLength-12 13285 88998 ns/op
BenchmarkFileWithContentLength-12 13695 89779 ns/op
PASS
Change https://golang.org/cl/305229 mentions this issue: net: use the size be sent as the count param of sendfile
@darkLord19 That's because macOS doesn't have
sendfile
. @cuonglm Thanks for spotting where the behavior changes. I don't think it would be right to always enable chunking because as noted above usingsendfile
is actually faster when using a larger file. I think we need to understand whysendfile
is so much slower for short files. CC @acln0 for any thoughts.I don't mean we always enable chunking, but still enable chunking when we detect there's
Content-Length
header. The current code which enable chunking is:Which happens if
hasCL
isfalse
. But after those code, there's:which seems useless, since when chunking is only set if
Content-Length
header was not existed. So I propose:diff --git a/src/net/http/server.go b/src/net/http/server.go index 25fab288f2..909adeb67f 100644 --- a/src/net/http/server.go +++ b/src/net/http/server.go @@ -1429,17 +1429,17 @@ func (cw *chunkWriter) writeHeader(p []byte) { if w.req.Method == "HEAD" || !bodyAllowedForStatus(code) { // do nothing - } else if code == StatusNoContent { + } else if code == StatusNoContent || hasCL { delHeader("Transfer-Encoding") - } else if hasCL { - delHeader("Transfer-Encoding") - } else if w.req.ProtoAtLeast(1, 1) { + } + + if w.req.ProtoAtLeast(1, 1) { // HTTP/1.1 or greater: Transfer-Encoding has been set to identity, and no // content-length has been provided. The connection must be closed after the // reply is written, and no chunking is to be done. This is the setup // recommended in the Server-Sent Events candidate recommendation 11, // section 8. - if hasTE && te == "identity" { + if !hasCL && hasTE && te == "identity" { cw.chunking = false w.closeAfterReply = true } else {
Note
The slow performance come from this line:
I tried this - this breaks HTTP server for range GETs etc - I have seen similar behavior with epoll_wait
under a combination of HEAD/GET requests where random 200ms delays spike up. All I can see by doing some extensive logging is that.
pconn.Read()
Regularly gets blocked, which seems to block and delay subsequent calls as well. This increases overall response times for calls - with unexpected latencies.
My OS is CentOS Linux release 8.3.2011
cc @neild
hey guys , what is going on here? Problem solved?
Rediscovered as https://github.com/golang/go/issues/61530
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
What did you expect to see?
Both benchmarks to take the same amount of time
What did you see instead?