Open fukamachi opened 10 years ago
Thanks for testing/setup. I loaded up the SBCL profiler and I'm seeing almost the exact same thing. I went through and made some changes, although mostly unrelated to writing:
*input-buffer*
. before, it was still copying out data, byte by byte via cffi:mem-aref
(deref-pointer-data)
/(get-callbacks)
are now more efficient.So far, I'm not really sure why uv_write is so expensive. It's a direct wrap of the underlying C function, nothing funny going on under the hood. Maybe libuv is just slow when writing?? When I get a chance I'll see how it compares to write-to-evbuffer (the write-to-uvstream equivalent from the master branch).
If you get a chance, do a pull on cl-libuv and cl-async and give it another shot. I'm interested if your throughput goes up (and interested to see another profile result).
If you get a chance, do a pull on cl-libuv and cl-async and give it another shot. I'm interested if your throughput goes up (and interested to see another profile result).
I just tried with the latest cl-libuv and cl-async again, and the throughput went up. (Requests/sec: 6599.93 -> 10162.96 / 1.5 times better)
Here's the profiling report. (NOTE: There's 1.5 times as many request/response cycles as the last report)
seconds | gc | consed | calls | sec/call | name
-------------------------------------------------------------
1.597 | 0.000 | 144 | 233,616 | 0.000007 | LIBUV:UV-WRITE
0.702 | 0.000 | 208 | 233,616 | 0.000003 | CL-ASYNC::WRITE-TO-UVSTREAM
0.208 | 0.000 | 32,864 | 233,658 | 0.000001 | CL-ASYNC::CHECK-SOCKET-OPEN
0.187 | 0.000 | 65,280 | 233,616 | 0.000001 | CL-ASYNC::TCP-WRITE-CB
0.151 | 0.000 | 17,388,160 | 38,936 | 0.000004 | WOO::HANDLE-REQUEST
0.150 | 0.000 | 0 | 233,616 | 0.000001 | LIBUV:ALLOC-REQ
0.148 | 0.004 | 65,545,440 | 38,952 | 0.000004 | WOO::SETUP-PARSER
0.147 | 0.004 | 15,232,896 | 38,936 | 0.000004 | WOO::PARSE-HOST-HEADER
0.080 | 0.000 | 0 | 233,616 | 0.000000 | LIBUV:FREE-UV-BUF
0.077 | 0.055 | 230,200,256 | 38,936 | 0.000002 | WOO.RESPONSE:WRITE-RESPONSE-HEADERS
0.064 | 0.000 | 48 | 272,561 | 0.000000 | LIBUV:ALLOC-UV-BUF
0.049 | 0.000 | 0 | 233,632 | 0.000000 | CL-ASYNC:SOCKET-C
0.031 | 0.000 | 0 | 233,616 | 0.000000 | CL-ASYNC::SOCKET-CONNECTED
0.031 | 0.000 | 48 | 233,658 | 0.000000 | CL-ASYNC:SOCKET-CLOSED
0.028 | 0.000 | 0 | 38,945 | 0.000001 | CL-ASYNC::TCP-ALLOC-CB
0.026 | 0.000 | 16 | 233,616 | 0.000000 | LIBUV:FREE-REQ
0.023 | 0.000 | 2,749,536 | 42 | 0.000547 | WOO::EVENT-CB
0.017 | 0.000 | 0 | 272,578 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-CATCH-APP-ERRORS
0.017 | 0.013 | 392,096 | 234 | 0.000072 | LIBUV:ERRVAL
0.013 | 0.000 | 0 | 38,945 | 0.000000 | CL-ASYNC::SOCKET-DRAIN-READ-BUFFER
0.009 | 0.000 | 0 | 38,936 | 0.000000 | WOO::HTTP-VERSION-KEYWORD
0.008 | 0.000 | 0 | 38,952 | 0.000000 | (SETF CL-ASYNC:SOCKET-DATA)
0.007 | 0.000 | 1,592,368 | 1 | 0.006992 | CL-ASYNC:TCP-SERVER
0.007 | 0.000 | 326,672 | 42 | 0.000158 | CL-ASYNC::EVENT-HANDLER
0.005 | 0.000 | 837,296 | 16 | 0.000305 | CL-ASYNC::INIT-INCOMING-SOCKET
0.005 | 0.000 | 914,784 | 42 | 0.000116 | CL-ASYNC:CLOSE-SOCKET
0.002 | 0.000 | 583,424 | 1 | 0.001993 | CL-ASYNC:DELAY
0.001 | 0.000 | 0 | 1 | 0.001000 | LIBUV:UV-LOOP-INIT
0.001 | 0.000 | 0 | 2 | 0.000500 | CL-ASYNC::TCP-SERVER-CLOSED
0.001 | 0.000 | 0 | 2 | 0.000500 | CL-ASYNC:EVENT-FREED
0.001 | 0.000 | 0 | 16 | 0.000062 | CL-ASYNC::SOCKET-DIRECTION
0.001 | 0.000 | 0 | 16 | 0.000062 | CL-ASYNC::TCP-SERVER-STREAM
0.001 | 0.000 | 0 | 18 | 0.000055 | LIBUV:UV-CLOSE
0.001 | 0.000 | 0 | 16 | 0.000061 | WOO::CONNECT-CB
0.000 | 0.000 | 0 | 38,936 | 0.000000 | WOO::HANDLE-RESPONSE
0.000 | 0.000 | 32,768 | 38,936 | 0.000000 | WOO::HANDLE-NORMAL-RESPONSE
0.000 | 0.000 | 34,299,264 | 38,936 | 0.000000 | WOO::READ-CB
0.000 | 0.000 | 0 | 1 | 0.000000 | WOO:RUN
0.000 | 0.000 | 0 | 18 | 0.000000 | LIBUV::HANDLE-TO-VAL
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBUV::HANDLE-FROM-VAL
0.000 | 0.000 | 0 | 2 | 0.000000 | LIBUV:UV-LOOP-CLOSE
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBUV:HANDLE-TYPE
0.000 | 0.000 | 0 | 17 | 0.000000 | LIBUV:UV-READ-STOP
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBUV:UV-IP-4-ADDR
0.000 | 0.000 | 0 | 18 | 0.000000 | LIBUV:FREE-HANDLE
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBUV:UV-TIMER-START
0.000 | 0.000 | 33,136 | 42 | 0.000000 | LIBUV:UV-ERR-NAME
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBUV:UV-STOP
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBUV:UV-WALK
0.000 | 0.000 | 0 | 18 | 0.000000 | LIBUV:ALLOC-HANDLE
0.000 | 0.000 | 0 | 16 | 0.000000 | LIBUV:UV-ACCEPT
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBUV:UV-TIMER-INIT
0.000 | 0.000 | 0 | 16 | 0.000000 | LIBUV:UV-READ-START
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBUV:UV-LISTEN
0.000 | 0.000 | 6,358,960 | 3 | 0.000000 | LIBUV:UV-RUN
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBUV:UV-TCP-BIND
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBUV:UV-LOOP-SIZE
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBUV:UV-TIMER-STOP
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBUV:UV-IS-CLOSING
0.000 | 0.000 | 0 | 17 | 0.000000 | LIBUV:UV-TCP-INIT
0.000 | 0.000 | 0 | 2 | 0.000000 | CL-ASYNC::PROCESS-EVENT-LOOP-EXIT-CALLBACKS
0.000 | 0.000 | 0 | 2 | 0.000000 | CL-ASYNC::CHECK-EVENT-UNFREED
0.000 | 0.000 | 0 | 17 | 0.000000 | CL-ASYNC::DO-CLOSE-TCP
0.000 | 0.000 | 0 | 1 | 0.000000 | (SETF CL-ASYNC::TCP-SERVER-CLOSED)
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC::TIMER-CB
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC::LOOP-EXIT-WALK-CB
0.000 | 0.000 | 0 | 2 | 0.000000 | CL-ASYNC::DO-CLOSE-LOOP
0.000 | 0.000 | 0 | 2 | 0.000000 | CL-ASYNC::CHECK-EVENT-LOOP-RUNNING
0.000 | 0.000 | 0 | 17 | 0.000000 | CL-ASYNC::TCP-CLOSE-CB
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC::TCP-SERVER-C
0.000 | 0.000 | 32,512 | 38,945 | 0.000000 | CL-ASYNC::TCP-READ-CB
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC::TIMER-CLOSE-CB
0.000 | 0.000 | 0 | 16 | 0.000000 | CL-ASYNC::TCP-ACCEPT-CB
0.000 | 0.000 | 0 | 1 | 0.000000 | (SETF CL-ASYNC:EVENT-FREED)
0.000 | 0.000 | 32,768 | 1 | 0.000000 | CL-ASYNC:START-EVENT-LOOP
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC:ADD-EVENT
0.000 | 0.000 | 0 | 16 | 0.000000 | (SETF CL-ASYNC:SOCKET-CLOSED)
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC:CLOSE-TCP-SERVER
0.000 | 0.000 | 0 | 38,936 | 0.000000 | CL-ASYNC:SOCKET-DATA
0.000 | 0.000 | 352 | 233,616 | 0.000000 | CL-ASYNC:WRITE-SOCKET-DATA
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC:ADD-EVENT-LOOP-EXIT-CALLBACK
0.000 | 0.000 | 0 | 2 | 0.000000 | CL-ASYNC:EVENT-C
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC:FREE-EVENT
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC:EVENT-FREED-P
0.000 | 0.000 | 0 | 16 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-NUM-CONNECTIONS-IN
0.000 | 0.000 | 0 | 16 | 0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-NUM-CONNECTIONS-IN)
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-DEFAULT-EVENT-HANDLER
0.000 | 0.000 | 0 | 2 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-ID
0.000 | 0.000 | 0 | 24 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-C
0.000 | 0.000 | 0 | 3 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-EXIT-FUNCTIONS
0.000 | 0.000 | 0 | 3 | 0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-EXIT-FUNCTIONS)
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-DATA-REGISTRY
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-FUNCTION-REGISTRY
-------------------------------------------------------------
3.795 | 0.076 | 376,651,296 | 3,622,026 | | Total
estimated total profiling overhead: 3.89 seconds
overhead estimation parameters:
8.000001e-9s/call, 1.074e-6s total profiling, 4.66e-7s internal profiling
So far, I'm not really sure why uv_write is so expensive.
Yeah, right. It's weird.
JFYI, here's the report when running with the master branch (libevent2). Though as:write-to-evbuffer
is at the top of it, it's less expensive than uv-write
.
seconds | gc | consed | calls | sec/call | name
-------------------------------------------------------------
0.324 | 0.000 | 8,320,544 | 349,674 | 0.000001 | CL-ASYNC::WRITE-TO-EVBUFFER
0.224 | 0.000 | 304 | 58,279 | 0.000004 | CL-ASYNC::TCP-WRITE-CB
0.214 | 0.000 | 99,552,208 | 58,279 | 0.000004 | WOO::HANDLE-REQUEST
0.200 | 0.006 | 4,643,040 | 58,279 | 0.000003 | WOO::PARSE-HOST-HEADER
0.161 | 0.000 | 208 | 407,985 | 0.000000 | CL-ASYNC::CHECK-SOCKET-OPEN
0.142 | 0.061 | 309,842,720 | 58,279 | 0.000002 | WOO.RESPONSE:WRITE-RESPONSE-HEADERS
0.123 | 0.005 | 105,812,432 | 58,295 | 0.000002 | WOO::SETUP-PARSER
0.080 | 0.000 | 0 | 349,674 | 0.000000 | LIBEVENT2:EVBUFFER-ADD
0.063 | 0.000 | 0 | 349,690 | 0.000000 | LIBEVENT2:BUFFEREVENT-ENABLE
0.062 | 0.000 | 0 | 116,558 | 0.000001 | LIBEVENT2:EVBUFFER-REMOVE
0.040 | 0.000 | 0 | 233,352 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-DATA-REGISTRY
0.035 | 0.000 | 0 | 233,290 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-FUNCTION-REGISTRY
0.034 | 0.000 | 1,343,488 | 349,674 | 0.000000 | LIBEVENT2:BUFFEREVENT-GET-OUTPUT
0.026 | 0.000 | 0 | 58,295 | 0.000000 | (SETF CL-ASYNC:SOCKET-DATA)
0.019 | 0.000 | 0 | 407,969 | 0.000000 | CL-ASYNC:SOCKET-C
0.016 | 0.000 | 3,349,088 | 1 | 0.015993 | CL-ASYNC:TCP-SERVER
0.011 | 0.000 | 0 | 116,591 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-CATCH-APP-ERRORS
0.005 | 0.000 | 0 | 58,279 | 0.000000 | WOO::HTTP-VERSION-KEYWORD
0.005 | 0.000 | 717,472 | 16 | 0.000302 | CL-ASYNC::INIT-INCOMING-SOCKET
0.004 | 0.000 | 0 | 407,969 | 0.000000 | CL-ASYNC:SOCKET-CLOSED
0.003 | 0.000 | 0 | 58,279 | 0.000000 | CL-ASYNC::SOCKET-DRAIN-READ-BUFFER
0.003 | 0.000 | 612,400 | 1 | 0.002991 | CL-ASYNC:DELAY
0.002 | 0.000 | 163,616 | 16 | 0.000108 | CL-ASYNC::TCP-EVENT-CB
0.001 | 0.000 | 0 | 58,279 | 0.000000 | LIBEVENT2:BUFFEREVENT-GET-INPUT
0.001 | 0.000 | 0 | 1 | 0.001000 | LIBEVENT2:EVENT-BASE-NEW
0.001 | 0.000 | 0 | 1 | 0.001000 | CL-ASYNC::TCP-SERVER-C
0.001 | 0.000 | 0 | 16 | 0.000062 | LIBEVENT2:BUFFEREVENT-FREE
0.001 | 0.000 | 0 | 16 | 0.000062 | CL-ASYNC::TCP-SERVER-STREAM
0.000 | 0.000 | 0 | 58,279 | 0.000000 | WOO::HANDLE-RESPONSE
0.000 | 0.005 | 32,912 | 58,279 | 0.000000 | WOO::HANDLE-NORMAL-RESPONSE
0.000 | 0.000 | 2,713,488 | 58,279 | 0.000000 | WOO::READ-CB
0.000 | 0.000 | 0 | 16 | 0.000000 | WOO::CONNECT-CB
0.000 | 0.000 | 0 | 16 | 0.000000 | WOO::EVENT-CB
0.000 | 0.000 | 0 | 1 | 0.000000 | WOO:RUN
0.000 | 0.000 | 0 | 16 | 0.000000 | LIBEVENT2:BUFFEREVENT-DISABLE
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBEVENT2:EVENT-ACTIVE
0.000 | 0.000 | 0 | 16 | 0.000000 | LIBEVENT2:BUFFEREVENT-GETFD
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBEVENT2:EVENT-BASE-FREE
0.000 | 0.000 | 0 | 16 | 0.000000 | LIBEVENT2:BUFFEREVENT-SETCB
0.000 | 0.000 | 0 | 16 | 0.000000 | LIBEVENT2:EVUTIL-MAKE-SOCKET-NONBLOCKING
0.000 | 0.000 | 2,052,096 | 1 | 0.000000 | LIBEVENT2:EVENT-BASE-DISPATCH
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBEVENT2:EVENT-FREE
0.000 | 0.000 | 0 | 16 | 0.000000 | LIBEVENT2:BUFFEREVENT-SOCKET-NEW
0.000 | 0.000 | 0 | 16 | 0.000000 | LIBEVENT2:BUFFEREVENT-SET-TIMEOUTS
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBEVENT2:EVENT-NEW
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBEVENT2:EVCONNLISTENER-FREE
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBEVENT2:EVCONNLISTENER-NEW-BIND
0.000 | 0.000 | 0 | 16 | 0.000000 | LIBEVENT2:EVCONNLISTENER-GET-BASE
0.000 | 0.000 | 0 | 1 | 0.000000 | LIBEVENT2:EVCONNLISTENER-SET-ERROR-CB
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC::PROCESS-EVENT-LOOP-EXIT-CALLBACKS
0.000 | 0.000 | 0 | 2 | 0.000000 | CL-ASYNC::CHECK-EVENT-UNFREED
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC::TCP-SERVER-CLOSED
0.000 | 0.000 | 0 | 1 | 0.000000 | (SETF CL-ASYNC::TCP-SERVER-CLOSED)
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC::TIMER-CB
0.000 | 0.000 | 0 | 32 | 0.000000 | CL-ASYNC::SOCKET-DIRECTION
0.000 | 0.005 | 328,160 | 58,279 | 0.000000 | CL-ASYNC::READ-SOCKET-DATA
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC::EVENT-FREE-CALLBACK
0.000 | 0.000 | 0 | 2 | 0.000000 | CL-ASYNC::CHECK-EVENT-LOOP-RUNNING
0.000 | 0.000 | 0 | 16 | 0.000000 | CL-ASYNC::TCP-ACCEPT-CB
0.000 | 0.000 | 160 | 58,279 | 0.000000 | CL-ASYNC::TCP-READ-CB
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC:ADD-EVENT
0.000 | 0.000 | 0 | 58,279 | 0.000000 | CL-ASYNC:SOCKET-DATA
0.000 | 0.000 | 0 | 2 | 0.000000 | CL-ASYNC:EVENT-FREED
0.000 | 0.000 | 0 | 1 | 0.000000 | (SETF CL-ASYNC:EVENT-FREED)
0.000 | 0.000 | 78,096 | 1 | 0.000000 | CL-ASYNC:START-EVENT-LOOP
0.000 | 0.000 | 416 | 349,674 | 0.000000 | CL-ASYNC:WRITE-SOCKET-DATA
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC:CLOSE-TCP-SERVER
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC:FREE-EVENT
0.000 | 0.000 | 0 | 2 | 0.000000 | CL-ASYNC:EVENT-C
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC:ADD-EVENT-LOOP-EXIT-CALLBACK
0.000 | 0.000 | 0 | 16 | 0.000000 | CL-ASYNC:SET-SOCKET-TIMEOUTS
0.000 | 0.000 | 0 | 16 | 0.000000 | CL-ASYNC:CLOSE-SOCKET
0.000 | 0.000 | 0 | 16 | 0.000000 | (SETF CL-ASYNC:SOCKET-CLOSED)
0.000 | 0.000 | 0 | 32 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-NUM-CONNECTIONS-IN
0.000 | 0.000 | 0 | 32 | 0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-NUM-CONNECTIONS-IN)
0.000 | 0.000 | 0 | 1 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-DEFAULT-EVENT-HANDLER
0.000 | 0.000 | 0 | 2 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-ID
0.000 | 0.000 | 0 | 8 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-C
0.000 | 0.000 | 0 | 2 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-EXIT-FUNCTIONS
0.000 | 0.000 | 0 | 2 | 0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-EXIT-FUNCTIONS)
0.000 | 0.000 | 0 | 1 | 0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-FUNCTION-REGISTRY)
0.000 | 0.000 | 0 | 1 | 0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-DATA-REGISTRY)
-------------------------------------------------------------
1.800 | 0.082 | 539,562,848 | 4,546,734 | | Total
I almost wonder if the problem could be worked around...buffer data in lisp to cut down on # of calls to uv_write such that the data is stored inside the socket object and sent either a) on the next loop of the event loop or b) once the socket's output buffer is full. This way numerous calls to write-to-uvstream
in the same pass of the loop would result in at most one (or a few, if a lot of data is sent) calls to uv_write. However, uv_write might just end up being linearly as slow with the amount of data being pumped into it, in which case there's not a lot we can do about it. Thoughts?
Woo calls as:write-socket-data
for each header lines. It might be a problem?
I tried changing it to reduce the number of calls by creating a byte vector of the whole headers first, then, interestingly, the score went up. (10162.96 -> 13898.52 req/sec)
As the strategy also works with libevent2-backed cl-async and libevent2 version still records better score (16417.39 req/sec), it's not a specific bottleneck of uv branch, though.
Odd that the libevent2 branch is faster, especially because it doesn't have the static-vectors optimization, which I assumed would cut down the read/write speeds significantly. The only thing I can think is...libevent is a lot faster than libuv. I'm having trouble finding low-level benchmarks comparing the two. I always just assumed libuv was faster, but it might not be.
I wonder if the static-vectors changes were done on master, would a wookie/woo app be able to beat the performance of node? I can try this tonight. I'm not married to using libuv, especially because the lack of built-in SSL support is going to cause me some late nights over the next week or so. However the libevent SSL implementation is lacking and may need to be redone by hand anyway, and libuv is much more actively maintained.
Either way, if a compiled language wrapping a C library can't beat a VM language wrapping a C library, then I must have done something wrong along the way. Maybe it makes sense to stick with libuv and focus on beating node instead of beating the libevent branch.
The write buffering looks like it should help a bit from your testing, so I'll give that a try tonight as well if I get a chance.
Thanks for all the testing and benchmarking you've done!
The only thing I can think is...libevent is a lot faster than libuv.
I never thought that but if so, the profiling result makes sense.
I wonder if the static-vectors changes were done on master, would a wookie/woo app be able to beat the performance of node?
Definitly.
Actually, Woo outperforms Node a little bit after reducing calls of as:write-socket-data
. I'm sure Wookie also can beat it since the major differences between Woo and Wookie are only the depending libraries (fast-http/http-parse and quri/puri).
This is good news. I'll write that buffering code for write-to-uvstream
to reduce calls to uv_write
and hopefully that will speed things up a good amount. I noticed some really nasty data overlap bugs popping up when using Wookie on the new UV branch when serving some real app data. So I need to figure that stuff out before the libuv branch can be used for anything other than benchmarks (it's probably a one-line fix I need to make somewhere, but might take a few hours to track down).
Also, do you know when fast-http will have multipart parsing? Once that gets in, I'll happily switch Wookie over to fast-http/quri.
Oh, I can't wait to try it!
Also, do you know when fast-http will have multipart parsing? Once that gets in, I'll happily switch Wookie over to fast-http/quri.
Multipart parser is already a part of fast-http.
Although :multipart-callback
of make-parser
is deleted, you can do the same thing by using :body-callback
and make-multipart-parser
. I thought it's cleaner.
Another major difference is the lacks of :store-body
and :force-stream
option because it always forces streaming.
I can't recommend to use :store-body T
naively without thinking about large POST attacks seriously.
As I see no good solution to protect from attackers currently, I leave the storing-body part to application users.
Oh good, I didn't know the multipart parser was done!
Another major difference is the lacks of :store-body and :force-stream option because it always forces streaming. I can't recommend to use :store-body T naively without thinking about large POST attacks seriously.
Agreed, which is why this was the default in http-parse (to not store the body unless explicitly asked for), but I also think removing it completely and forcing the app to deal with storage is a good move. I will have to re-wire some of the POST stuff in Wookie probably, but that's not really a big issue. I also agree with forcing streaming as well.
BTW just implemented the socket write buffering, and also replaced some other inefficient buffering code as well. I'm currently working on the issue where multiple requests coming in at once get the responses mixed up. I'm working on creating a reproducible test case and seeing if I can find the problem.
So you can use the current changes for benchmarking, but you may run into weird request/response data overlap if you use the uv branch for anything else.
Data corruption issues are fixed. Turns out it wasn't cross-socket pollution, it was a much simpler problem where responses > 64K were getting the same data sent for each chunk (as opposed to the chunks progressing over the length of the full write).
Odd thing happened.
It seems after I changed Woo for reducing calls of as:write-socket-data
, the uv
branch's score was almost the same as master
's.
uv | master | |
---|---|---|
Requests/sec | 16294.04 | 16231.17 |
Transfer/sec | 2.05MB | 2.04MB |
I never noticed that.
And I git pulled the latest cl-async and benchmarked again, then, interestingly, the score went down.
uv | master | |
---|---|---|
Requests/sec | 12405.08 | 16231.17 |
Transfer/sec | 1.56MB | 2.04MB |
It seems the recent changes doesn't suit to the current Woo strategy.
I also benchmarked with a little bit older Woo (which calls as:write-socket-data
naively).
uv | master | |
---|---|---|
Requests/sec | 10365.12 | 15315.03 |
Transfer/sec | 1.13MB | 1.67MB |
uv | master | |
---|---|---|
Requests/sec | 11629.93 | 15315.03 |
Transfer/sec | 1.26MB | 1.67MB |
Wow, the recent changes work on this Woo.
I changed Woo a little bit to call as:write-socket-data
twice for headers and body (not once), then the score went back to almost the same to before's. Interesting.
No, it didn't.. sorry.
I fixed the above benchmark result a little bit.
I'm still trying that, but probably e7df0e44944d0c98a555509ccb6fab4d9f653b76 is better than the latest uv branch.
Interesting, I thought buffering would have helped, but maybe it's best to let libuv handle it. I have no problem reverting those changes, however it almost seems there's an optimal amount of buffering that should happen between buffering all calls in one loop pass and calling uv_write
directly for everything. I wonder if there's a sweet spot here. This seems like a pretty aggressive optimization though, so for now I'll revert to the non-buffering write-socket-data
and we can revisit later if we need to.
I also wonder if replacing flexi-streams with fast-io will increase buffering speed to the point of making a significant difference. I wanted to try this last night but ran out of time. I will see if I can find time tonight.
A few changes. I made write buffering switchable via *buffer-writes*
.(nil
by default). This should allow easy benchmarking.
I replaced all buffering in sockets/streams with fast-io (make sure you pull fast-io from master for the new exported functions). I hope this will speed things up a bit more.
I also am exporting *buffer-size*
which allows setting the size of the input and output buffers (which are static vectors). However, I noticed that when set above the recommended 65535 bytes, calls to uv_write seem to have really odd data crossover between sockets. This is tested for in the bundled no-overlap
test. I spent a while trying to figure out what's going on, and my conclusion is that it's a libuv bug...the data being passed to uv_write seems to not match up with the data received on the socket. This has only been tested with *buffer-size* == (* 1024 128)
using local connections in the same event loop. For now, I'd leave *buffer-size*
as is.
I benchmarked again with the latest cl-async and fast-io. Here's the report:
Before | After | |
---|---|---|
Requests/sec | 12293.87 | 16029.19 |
Transfer/sec | 1.55MB | 2.02MB |
Hey, it's much better than before!
Here's the latest profiling report:
seconds | gc | consed | calls | sec/call | name
-------------------------------------------------------------
1.909 | 0.047 | 436,468,864 | 99,788 | 0.000019 | WOO.RESPONSE:RESPONSE-HEADERS-BYTES
0.793 | 0.000 | 64 | 99,788 | 0.000008 | LIBUV:UV-WRITE
0.563 | 0.012 | 43,499,152 | 99,788 | 0.000006 | WOO::HANDLE-REQUEST
0.403 | 0.006 | 49,854,528 | 99,788 | 0.000004 | WOO::PARSE-HOST-HEADER
0.318 | 0.000 | 0 | 99,788 | 0.000003 | CL-ASYNC::WRITE-TO-UVSTREAM
0.127 | 0.000 | 288 | 99,788 | 0.000001 | CL-ASYNC::TCP-WRITE-CB
0.081 | 0.000 | 0 | 99,788 | 0.000001 | LIBUV:ALLOC-REQ
0.065 | 0.000 | 0 | 99,788 | 0.000001 | LIBUV:FREE-UV-BUF
0.063 | 0.000 | 0 | 99,804 | 0.000001 | CL-ASYNC::TCP-ALLOC-CB
0.045 | 0.000 | 0 | 99,804 | 0.000000 | CL-ASYNC::CHECK-SOCKET-OPEN
0.034 | 0.000 | 0 | 199,592 | 0.000000 | LIBUV:ALLOC-UV-BUF
0.032 | 0.000 | 0 | 199,609 | 0.000000 | CL-ASYNC-BASE:EVENT-BASE-CATCH-APP-ERRORS
0.030 | 0.000 | 0 | 99,804 | 0.000000 | CL-ASYNC::SOCKET-DRAIN-READ-BUFFER
0.028 | 0.000 | 0 | 99,788 | 0.000000 | LIBUV:FREE-REQ
0.026 | 0.015 | 1,599,936 | 1 | 0.025992 | CL-ASYNC:TCP-SERVER
0.026 | 0.000 | 0 | 99,788 | 0.000000 | WOO::HTTP-VERSION-KEYWORD
Most of slow, consing parts belong to Woo. Good job!
The current Woo implementation creates a response octet byte vector every time and throws it to as:write-socket-data
.
Probably I can avoid the data copying by writing into cl-async's *output-buffer*
directly.
This is great news! I'm glad we got the numbers looking this good.
Was this latest benchmark with *buffer-writes*
set to t
or nil
? I'm curious to see if buffering the writes with fast-io was more or less performant than passing the writes directly through to uv_write.
Oh, I actually never pushed the *buffer-writes*
stuff. Oops. Should be up there now.
@orthecreedence @fukamachi Did either of you run a side-by-side strace or perf? I would expect libevent and libuv to have roughly similar performance because normally the real limiting factor is the operating system and the network.
If you can give me step-by-step instructions on how to reproduce your setup, I'll be happy to take a look.
I think it would make sense to exclude the HTTP stuff (or at least the parsing aspect) and just return hard-coded responses, that way we could easily benchmark just the TCP aspect of it (which is what we care about) instead of the TCP/HTTP/routing/etc.
I can work on whipping up a simple test script tonight that does this. I also uploaded a pre-merge-uv tag that can be used to switch between libuv and libevent.
To test, all that is required is SBCL (with quicklisp installed), libuv-v1.x/libevent-2.0.2x installed, and cl-async downloaded via git to ~/quicklisp/local-projects/cl-async
.
I'll try to get that benchmark script out sooner rather than later, w/ running instructions.
@bnoordhuis do you have any guidelines/best practices on benchmarking? You've probably done it a bit more than me.
Benchmark script here: https://gist.github.com/orthecreedence/b8a1796aed8023fe8a57
Setup:
$ wget http://beta.quicklisp.org/quicklisp.lisp
$ /path/to/sbcl/run-sbcl.sh --load quicklisp.lisp
* (quicklisp-quickstart:install)
* (ql:add-to-init-file)
* (quit)
$ mkdir -p ~/quicklisp/local-projects
$ cd ~/quicklisp/local-projects
$ git clone https://github.com/orthecreedence/cl-async
$ git clone https://github.com/rpav/fast-io
$ cd cl-async/
$ git checkout -b libevent pre-merge-uv # create a libevent branch
$ git checkout master # back to libuv
Running:
/path/to/sbcl/run-sbcl.sh --load /path/to/cl-async-bench.lisp
...
Listening on 8008
Benchmarking (using wrk):
wrk -c 25 -t 1 -d 10 http://127.0.0.1:8008/
My results (libuv):
Running 10s test @ http://127.0.0.1:8008/
1 threads and 25 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 603.62us 482.57us 6.89ms 89.66%
Req/Sec 39.75k 8.69k 59.44k 65.64%
373468 requests in 10.00s, 35.62MB read
Requests/sec: 37355.37
Transfer/sec: 3.56MB
libevent:
Running 10s test @ http://127.0.0.1:8008/
1 threads and 25 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 586.37us 1.07ms 21.85ms 96.11%
Req/Sec 41.60k 9.18k 65.56k 66.17%
391896 requests in 10.00s, 37.37MB read
Requests/sec: 39193.57
Transfer/sec: 3.74MB
Although the spread is much smaller in this test, libevent is still consistently faster.
Just updated the benchmark script to split the HTTP response into two writes, and the differences are now extreme (this can be turned off by passing :split-writes nil
in the (start-server ...)
call).
libuv:
Running 10s test @ http://127.0.0.1:8008/
1 threads and 25 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 40.25ms 1.08ms 55.30ms 94.48%
Req/Sec 623.26 22.56 716.00 87.07%
6179 requests in 10.00s, 605.37KB read
Requests/sec: 617.76
Transfer/sec: 60.52KB
libevent:
Running 10s test @ http://127.0.0.1:8008/
1 threads and 25 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 570.89us 743.13us 19.27ms 94.95%
Req/Sec 40.88k 8.95k 64.33k 66.67%
383393 requests in 10.00s, 36.56MB read
Requests/sec: 38342.94
Transfer/sec: 3.66MB
It seems that splitting the HTTP response into two calls vs just the one slows things down by about 60x =[. this seems really steep, so I can't discount this being a problem in cl-async.
Hey @orthecreedence, thanks for the instructions! I did some black box testing and I can confirm your findings. Below are my own, tested with sbcl 1.2.4-1.fc21 on x86_64 linux:
Everything so far points to memory management issues, not event loop or I/O latency. cl-libuv seems to consistently use a little more memory - about 145 MB vs 135 MB max - but I don't know how significant that is.
At any rate, once memory pressure is taken out of the equation, I expect cl-libuv to take a slight lead over the libevent backend. As to how easy that is to fix, you probably know that better than I do. :-) Hope that helps!
cl-libuv seems to consistently use a little more memory - about 145 MB vs 135 MB max - but I don't know how significant that is.
Probably insignificant. I just tried sbcl HEAD and now it's 127 MB vs 135 MB in favor of cl-libuv but everything else is still the same.
I also built a custom sbcl that mmaps memory eagerly with MAP_POPULATE but that didn't have much impact. I didn't really expect it too, either; sbcl is too advanced for a simple trick like that. I also tried an sbcl build with --dynamic-space-size=1.5G
but that only made things worse.
I looked into why cl-libuv sees so much more context switches and most happen inside the epoll_wait() system call. Context switching inside epoll_wait() normally means that the process is blocking for I/O but I'm not sure what could make that happen here. Maybe my memory pressure hypothesis was wrong and this is a latency issue after all.
On my machine the benchmark script gives consistently about 625 requests/s with both :split-writes nil
and :split-writes t
. SBCL 1.2.6, current cl-async master.
What actually helps is doing as:close-socket
for incoming connection
after as:write-socket-data
:
Running 10s test @ http://127.0.0.1:8009/
1 threads and 25 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 563.81us 274.98us 8.57ms 98.12%
Req/Sec 32.30k 2.94k 35.33k 95.15%
303383 requests in 10.00s, 28.93MB read
Socket errors: connect 0, read 303378, write 0, timeout 0
Requests/sec: 30338.82
Transfer/sec: 2.89MB
So, what's actually happening here is event loop being overloaded with open sockets. Don't know why libevent does better though, and why you had better results without spliitting writes.
@ivan4th That's curious. With :split-writes nil
, I get numbers comparable to libevent.
I did figure out why :split-writes t
performs so atrocious. cl-async doesn't coalesce writes so the extra write creates an extra TCP round-trip. By default on Linux, that introduces 40 ms of latency with loopback traffic because of the delayed ACK.
This is great feedback! So what I'm thinking is it would make sense overall to do write buffering on TCP sockets. Is this correct?
@orthecreedence If you can find a sweet spot between latency and throughput, then yes, buffering should help a great deal. Alternatively, you could set uv_tcp_nodelay(handle, 1)
to disable Nagle's algorithm but that may have an adverse effect when there are many small writes.
Actually, this is not an issue, just a feedback. You can close this whenever you want.
I ran Woo with libuv-backed cl-async and profiled it.
1. Loading Woo
NOTE: Woo requires the latest fast-http.
2. Profiling
3. Starting a server
It will be started at http://127.0.0.1:5000/.
4. wrk
Send HTTP requests 10 sec by wrk.
5. See the report
Here's the report. Is writing part the current bottleneck?