2.0.3 Performance Regression

jdmcd commented 7 years ago

Please see https://github.com/vapor/vapor/issues/1101

tanner0101 commented 7 years ago

Just to confirm, this does not appear to be happening on macOS Sierra.

Engine: 2.1.3 Vapor: 2.1.2

~$ wrk -t 4 -c 128 -d 10 http://localhost:8080/hello
Running 10s test @ http://localhost:8080/hello
  4 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.90ms  130.49us  13.97ms   92.81%
    Req/Sec    17.66k     3.60k   23.53k    54.46%
  710091 requests in 10.10s, 56.89MB read
Requests/sec:  70303.63
Transfer/sec:      5.63MB
~$

But i'm able to recreate on Ubuntu 16.04 Desktop w/ same dep versions.

parallels@ubuntu:~/Desktop/Speedtest$ wrk -t 4 -c 128 -d 10 http://localhost:8080/hello
Running 10s test @ http://localhost:8080/hello
  4 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    40.72ms   23.94ms   1.00s    99.82%
    Req/Sec    81.64     38.22   151.00     59.50%
  1631 requests in 10.03s, 194.08KB read
  Socket errors: connect 0, read 0, write 0, timeout 9
Requests/sec:    162.64
Transfer/sec:     19.35KB

jdmcd commented 7 years ago

Looks right to me

tanner0101 commented 7 years ago

As far as I can tell, this appears to be an issue w/ wrk not properly handling multiple writes to the stream. @vzsg @vi4m were you able to recreate this issue on tools other than wrk?

jdmcd commented 7 years ago

Hm, I think @vzsg did it with multiple versions of Engine though? I'm not sure why different versions would cause different results with wrk in particular

jdmcd commented 7 years ago

But it's totally possible I'm missing something too.

tanner0101 commented 7 years ago

@mcdappdev Prior to Engine 2.0.3, if the response was less than 2048 bytes in length, it would be sent entirely in one call to libc.send(). With 2.0.3, the body is always sent as a separate write call.

tanner0101 commented 7 years ago

Running wrk on macOS sierra against Vapor running in Linux VM:

~$ wrk -t 4 -c 128 -d 10 http://10.211.55.3:8080/hello
Running 10s test @ http://10.211.55.3:8080/hello
  4 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   109.77us   34.20us   5.30ms   95.09%
    Req/Sec    16.18k     3.62k   18.64k    82.79%
  196749 requests in 10.10s, 22.70MB read
  Socket errors: connect 0, read 0, write 0, timeout 1
Requests/sec:  19480.57
Transfer/sec:      2.25MB

Running wrk from inside the Linux VM Vapor is running in:

parallels@ubuntu:~/Desktop/Speedtest$ wrk -t 4 -c 128 -d 5 http://localhost:8080/hello
Running 5s test @ http://localhost:8080/hello
  4 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    41.91ms   43.22ms   1.00s    99.80%
    Req/Sec   100.38     35.86   161.00     60.00%
  501 requests in 5.03s, 59.99KB read
  Socket errors: connect 0, read 0, write 0, timeout 4
Requests/sec:     99.70
Transfer/sec:     11.94KB

Seems likely that the implementation of wrk on Linux is not continuing to read from the socket until it gets the entire response.

tanner0101 commented 7 years ago

Closing this issue since all indication I can find points to this being a bug in wrk. I'd attempt to find a workaround so that Vapor's wrk scores on Linux still look good, but this seems like a waste of time considering Swift Server APIs could soon replace our HTTP server implementation.

vzsg commented 7 years ago

I was running wrk on macOS in all tests, never inside the container, or perhaps I'm misunderstanding something about the experiments above.

(Those Swift Server APIs cannot arrive soon enough.)

vzsg commented 7 years ago

I just tried to benchmark using Apache Benchmark (ab), the results got even weirder. It seems that ab gets stuck after each sent response, waiting 30 seconds until timeout – rendering this benchmark pretty useless.

What's really strange is that Node doesn't show this symptom with ab... Anyway, I'll look for another benchmark tool.

vzsg commented 7 years ago

Another benchmark tool, vegeta works fine however, and shows the slowdown between 2.0.2 and 2.1.3:

Latencies     [mean, 50, 95, 99, max]  2.173735ms, 2.313166ms, 2.78579ms, 2.878385ms, 5.671501ms  // 2.0.2
...
Latencies     [mean, 50, 95, 99, max]  33.34679ms, 50.587758ms, 50.835336ms, 50.937763ms, 61.981379ms  // 2.1.3

vi4m commented 7 years ago

Just to clarify @tanner0101 - I run my tests using wrk client on macOS, not linux, same as @vzsg

Also, I see, that you run your vapor server on "Parallels" - could you try with ubuntu docker image too? Our production servers run on docker, so for me it's the only way to scale it in production.

BrettRToomey commented 7 years ago

@tanner0101 I think the issue is related to the fact you don't init your socket with TCP_CORK or you don't call send with MSG_MORE.

tcp(7)

TCP_CORK (since Linux 2.2)
              If set, don't send out partial frames.  All queued partial
              frames are sent when the option is cleared again.  This is
              useful for prepending headers before calling sendfile(2), or
              for throughput optimization.

send(2)

MSG_MORE (since Linux 2.4.4)
              The caller has more data to send.  This flag is used with TCP
              sockets to obtain the same effect as the TCP_CORK socket
              option (see tcp(7)), with the difference that this flag can be
              set on a per-call basis.

If this is the case, I recommend adding an optional flags parameter to this function and on the first call that writes out the headers pass in MSG_MORE

tanner0101 commented 7 years ago

@vzsg seems like ab also expects the entirety of the response to arrive in one read call.

@vzsg @vi4m wrk works fine on macOS for me (I got 70303.63 req/s in https://github.com/vapor/engine/issues/153#issuecomment-320771021). The only problem I had with wrk was running the Linux compiled version. Are you saying you're getting bad benchmark results with wrk for macOS?

@BrettRToomey I haven't been able to read anywhere that sending HTTP req/res in multiple write calls is not allowed (in fact it is required for stuff like SSE). Does anyone know of any documentation about what behavior should be expected in terms of number of calls to send? Perhaps even though it's not specifically mentioned in the HTTP spec, it's become a standard because of how much simpler it is to implement an HTTP parser the doesn't need to handle that edge case.

I'll take a look at implementing socket flushing using the MSG_MORE option and see if that fixes

vzsg commented 7 years ago

Yes, I've been continuously testing with wrk (and ab and vegeta) running on macOS, while the Vapor helloworld was running inside Docker.

Joannis commented 7 years ago

WRK seems to be buggy on macOS high Sierra here. I read the HTTP docs lately, didn't see anything about specific requirements for sending data on a socket.

robertjpayne commented 7 years ago

TCP_NODELAY may be of use too, sounds like Nagle algorithm is definitely reducing throughput by delaying small packets.

BrettRToomey commented 7 years ago

This isn't a TCP issue. If it's the blocking on small packets, it's a TCP issue, a layer below.

BrettRToomey commented 7 years ago

@robertjpayne Yeah, seems something like that. It may be easier to check in Wireshark or something and see the actual packets being sent out.

BrettRToomey commented 7 years ago

virtualbox_ubuntu testing_14_08_2017_16_09_49

I ended up running it in virtual box myself. It looks like it comes in as two TCP packets (as expected) and then it gets reassembled into a single packet and then interpreted as HTTP, as expected. I'll try running this again with wrk and try to filter through the noise.

Edit: Wow... It looks like when spamming the response type for HTTP packets goes up a lot! It smells a lot like you're getting hit by Nagle's algorithm. ~If this is true, hypothetically, I should be able to increase the payload size and the performance should go up.~ Let me go try that.

Edit edit: I think I successfully disabled Nagle's algorithm system-wide on Linux and I'm still seeing poor performance under wrk. Although, while using cURL all of the statistics, including latency, seem to be perfectly fine.

virtualbox_ubuntu testing_14_08_2017_16_37_17

robertjpayne commented 7 years ago

@vi4m I think Docker has something to do with it too -- running Vapor inside Docker on macOS (Regular Docker.app install) A simple "Hello World" plain text benchmark caps at about 200/req/s.

This is with docker set to use 10GB of ram and 6vCPU since it does run inside a VM.

tanner0101 commented 7 years ago

I'd be interested to see if this fixes the issue https://github.com/vapor/engine/pull/168 (for anyone who can recreate it). That PR modifies the basic server in engine to send the entire response as one libc.send() (unless it's chunked of course)

vzsg commented 7 years ago

Doesn't compile for me :(

/code/.build/checkouts/engine.git-3458089043233217718/Sources/HTTP/Server/BasicServer.swift:156:44: error: cannot convert value of type 'UnsafeRawBufferPointer' to expected argument type 'UnsafeBufferPointer<UInt8>'
    static let empty = DispatchData(bytes: UnsafeRawBufferPointer(start: nil, count: 0))
                                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/code/.build/checkouts/engine.git-3458089043233217718/Sources/HTTP/Server/BasicServer.swift:110:48: error: cannot convert value of type 'UnsafeRawBufferPointer' to expected argument type 'UnsafeBufferPointer<UInt8>'
                let data = DispatchData(bytes: buffer)
                                               ^~~~~~
/code/.build/checkouts/engine.git-3458089043233217718/Sources/HTTP/Server/BasicServer.swift:122:48: error: cannot convert value of type 'UnsafeRawBufferPointer' to expected argument type 'UnsafeBufferPointer<UInt8>'
                let data = DispatchData(bytes: buffer)
                                               ^~~~~~

tanner0101 commented 7 years ago

Ah looks like a Swift 4 change.

vapor / http

2.0.3 Performance Regression #153