[uv] Feedback about the performance

fukamachi commented 10 years ago

Actually, this is not an issue, just a feedback. You can close this whenever you want.

I ran Woo with libuv-backed cl-async and profiled it.

1. Loading Woo

(ql:quickload :woo)

NOTE: Woo requires the latest fast-http.

2. Profiling

(sb-profile:profile "WOO" "WOO.RESPONSE" "LIBUV" "CL-ASYNC" "CL-ASYNC-BASE")

3. Starting a server

(woo:run (lambda (env) (declare (ignore env)) '(200 () ("Hello"))) :use-thread nil)

It will be started at http://127.0.0.1:5000/.

4. wrk

Send HTTP requests 10 sec by wrk.

$ wrk -c 15 -t 5 -d 10 http://127.0.0.1:5000
Running 10s test @ http://127.0.0.1:5000
  5 threads and 15 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.87ms    1.50ms  26.21ms   96.73%
    Req/Sec   539.87    100.97   692.00     50.99%
  25914 requests in 10.00s, 2.62MB read
Requests/sec:   2590.98
Transfer/sec:    268.21KB

5. See the report

Here's the report. Is writing part the current bottleneck?

> (sb-profile:report)
  seconds  |     gc     |    consed   |   calls   |  sec/call  |  name  
-------------------------------------------------------------
     1.085 |      0.000 |          32 |   155,562 |   0.000007 | LIBUV:UV-WRITE
     0.729 |      0.000 |  55,280,176 |   181,502 |   0.000004 | LIBUV:ALLOC-UV-BUF
     0.485 |      0.000 |         608 |   155,562 |   0.000003 | CL-ASYNC::WRITE-TO-UVSTREAM
     0.196 |      0.000 |         336 |   155,590 |   0.000001 | CL-ASYNC::CHECK-SOCKET-OPEN
     0.121 |      0.000 |           0 | 1,607,640 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-DATA-REGISTRY
     0.114 |      0.000 |         224 |   155,562 |   0.000001 | CL-ASYNC::TCP-WRITE-CB
     0.104 |      0.000 |           0 |   155,562 |   0.000001 | LIBUV:ALLOC-REQ
     0.103 |      0.000 |   9,724,432 |    25,927 |   0.000004 | WOO::PARSE-HOST-HEADER
     0.089 |      0.021 |  40,968,672 |    25,943 |   0.000003 | WOO::SETUP-PARSER
     0.082 |      0.000 |   9,412,992 |    25,927 |   0.000003 | WOO::HANDLE-REQUEST
     0.071 |      0.000 |           0 |   155,562 |   0.000000 | LIBUV:FREE-UV-BUF
     0.047 |      0.000 |           0 |   674,236 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-FUNCTION-REGISTRY
     0.036 |      0.000 |           0 |   155,578 |   0.000000 | CL-ASYNC:SOCKET-C
     0.035 |      0.000 |           0 |   155,562 |   0.000000 | LIBUV:FREE-REQ
     0.022 |      0.000 |           0 |   155,578 |   0.000000 | CL-ASYNC::SOCKET-CLOSING
     0.021 |      0.000 |   2,616,592 |        28 |   0.000750 | WOO::EVENT-CB
     0.020 |      0.000 |           0 |   155,590 |   0.000000 | CL-ASYNC:SOCKET-CLOSED
     0.017 |      0.000 |           0 |   155,562 |   0.000000 | CL-ASYNC::SOCKET-CONNECTED
     0.016 |      0.000 |           0 |    25,940 |   0.000001 | CL-ASYNC::TCP-ALLOC-CB
     0.012 |      0.000 |           0 |   181,519 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-CATCH-APP-ERRORS
     0.008 |      0.000 |           0 |    25,943 |   0.000000 | (SETF CL-ASYNC:SOCKET-DATA)
     0.006 |      0.000 |           0 |    25,940 |   0.000000 | CL-ASYNC::SOCKET-DRAIN-READ-BUFFER
     0.004 |      0.000 |           0 |    25,940 |   0.000000 | LIBUV:UV-BUF-READ
     0.002 |      0.000 |     196,240 |        28 |   0.000060 | CL-ASYNC::EVENT-HANDLER
     0.001 |      0.000 |           0 |         2 |   0.000500 | CL-ASYNC::TCP-SERVER-CLOSED
     0.001 |      0.000 |           0 |         1 |   0.000989 | CL-ASYNC:TCP-SERVER
     0.001 |      0.000 |     295,312 |       142 |   0.000007 | LIBUV:ERRVAL
     0.001 |      0.000 |           0 |        17 |   0.000055 | CL-ASYNC::TCP-CLOSE-CB
     0.001 |      0.000 |           0 |    25,927 |   0.000000 | WOO::HTTP-VERSION-KEYWORD
     0.000 |      0.000 |           0 |    25,927 |   0.000000 | WOO::HANDLE-RESPONSE
     0.000 |      0.000 |   8,413,744 |    25,927 |   0.000000 | WOO::HANDLE-NORMAL-RESPONSE
     0.000 |      0.000 |   9,723,824 |    25,927 |   0.000000 | WOO::READ-CB
     0.000 |      0.000 |           0 |        16 |   0.000000 | WOO::CONNECT-CB
     0.000 |      0.000 |           0 |         1 |   0.000000 | WOO:RUN
     0.000 |      0.028 | 131,123,168 |    25,927 |   0.000000 | WOO.RESPONSE:WRITE-RESPONSE-HEADERS
     0.000 |      0.000 |           0 |        18 |   0.000000 | LIBUV::HANDLE-TO-VAL
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV::HANDLE-FROM-VAL
     0.000 |      0.000 |           0 |         2 |   0.000000 | LIBUV:UV-LOOP-CLOSE
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:HANDLE-TYPE
     0.000 |      0.000 |           0 |        17 |   0.000000 | LIBUV:UV-READ-STOP
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-IP-4-ADDR
     0.000 |      0.000 |       1,536 |        18 |   0.000000 | LIBUV:UV-CLOSE
     0.000 |      0.000 |           0 |        18 |   0.000000 | LIBUV:FREE-HANDLE
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-TIMER-START
     0.000 |      0.000 |           0 |        28 |   0.000000 | LIBUV:UV-ERR-NAME
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-STOP
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-WALK
     0.000 |      0.000 |           0 |        18 |   0.000000 | LIBUV:ALLOC-HANDLE
     0.000 |      0.000 |           0 |        16 |   0.000000 | LIBUV:UV-ACCEPT
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-TIMER-INIT
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-LOOP-INIT
     0.000 |      0.000 |           0 |        16 |   0.000000 | LIBUV:UV-READ-START
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-LISTEN
     0.000 |      0.000 |   7,274,000 |         3 |   0.000000 | LIBUV:UV-RUN
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-TCP-BIND
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-LOOP-SIZE
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-TIMER-STOP
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-IS-CLOSING
     0.000 |      0.000 |           0 |        17 |   0.000000 | LIBUV:UV-TCP-INIT
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC::PROCESS-EVENT-LOOP-EXIT-CALLBACKS
     0.000 |      0.000 |           0 |        16 |   0.000000 | CL-ASYNC::INIT-INCOMING-SOCKET
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC::CHECK-EVENT-UNFREED
     0.000 |      0.000 |         544 |        17 |   0.000000 | CL-ASYNC::DO-CLOSE-TCP
     0.000 |      0.000 |           0 |         1 |   0.000000 | (SETF CL-ASYNC::TCP-SERVER-CLOSED)
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC::TIMER-CB
     0.000 |      0.000 |           0 |        16 |   0.000000 | CL-ASYNC::SOCKET-DIRECTION
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC::LOOP-EXIT-WALK-CB
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC::DO-CLOSE-LOOP
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC::CHECK-EVENT-LOOP-RUNNING
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC::TCP-SERVER-C
     0.000 |      0.000 |   3,601,088 |    25,940 |   0.000000 | CL-ASYNC::TCP-READ-CB
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC::TIMER-CLOSE-CB
     0.000 |      0.000 |           0 |        16 |   0.000000 | CL-ASYNC::TCP-SERVER-STREAM
     0.000 |      0.000 |           0 |        16 |   0.000000 | CL-ASYNC::TCP-ACCEPT-CB
     0.000 |      0.000 |           0 |        16 |   0.000000 | (SETF CL-ASYNC::SOCKET-CLOSING)
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC:EVENT-FREED
     0.000 |      0.000 |           0 |         1 |   0.000000 | (SETF CL-ASYNC:EVENT-FREED)
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:START-EVENT-LOOP
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:ADD-EVENT
     0.000 |      0.000 |           0 |        16 |   0.000000 | (SETF CL-ASYNC:SOCKET-CLOSED)
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:CLOSE-TCP-SERVER
     0.000 |      0.000 |           0 |    25,927 |   0.000000 | CL-ASYNC:SOCKET-DATA
     0.000 |      0.000 |           0 |   155,562 |   0.000000 | CL-ASYNC:WRITE-SOCKET-DATA
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:ADD-EVENT-LOOP-EXIT-CALLBACK
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC:EVENT-C
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:FREE-EVENT
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:DELAY
     0.000 |      0.000 |           0 |        28 |   0.000000 | CL-ASYNC:CLOSE-SOCKET
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:EVENT-FREED-P
     0.000 |      0.000 |           0 |        16 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-NUM-CONNECTIONS-IN
     0.000 |      0.000 |           0 |        16 |   0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-NUM-CONNECTIONS-IN)
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-DEFAULT-EVENT-HANDLER
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-ID
     0.000 |      0.000 |           0 |        24 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-C
     0.000 |      0.000 |           0 |         3 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-EXIT-FUNCTIONS
     0.000 |      0.000 |      32,752 |         3 |   0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-EXIT-FUNCTIONS)
     0.000 |      0.000 |           0 |         1 |   0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-FUNCTION-REGISTRY)
     0.000 |      0.000 |           0 |         1 |   0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-DATA-REGISTRY)
-------------------------------------------------------------
     3.426 |      0.049 | 278,666,272 | 4,875,443 |            | Total

estimated total profiling overhead: 5.17 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.0599999e-6s total profiling, 4.e-7s internal profiling

orthecreedence commented 10 years ago

Thanks for testing/setup. I loaded up the SBCL profiler and I'm seeing almost the exact same thing. I went through and made some changes, although mostly unrelated to writing:

tcp-read-cb now actually uses the static vector *input-buffer*. before, it was still copying out data, byte by byte via cffi:mem-aref
the tracking system for data/callbacks used internally (deref-pointer-data)/(get-callbacks) are now more efficient.

So far, I'm not really sure why uv_write is so expensive. It's a direct wrap of the underlying C function, nothing funny going on under the hood. Maybe libuv is just slow when writing?? When I get a chance I'll see how it compares to write-to-evbuffer (the write-to-uvstream equivalent from the master branch).

If you get a chance, do a pull on cl-libuv and cl-async and give it another shot. I'm interested if your throughput goes up (and interested to see another profile result).

fukamachi commented 10 years ago

If you get a chance, do a pull on cl-libuv and cl-async and give it another shot. I'm interested if your throughput goes up (and interested to see another profile result).

I just tried with the latest cl-libuv and cl-async again, and the throughput went up. (Requests/sec: 6599.93 -> 10162.96 / 1.5 times better)

Here's the profiling report. (NOTE: There's 1.5 times as many request/response cycles as the last report)

  seconds  |     gc     |    consed   |   calls   |  sec/call  |  name  
-------------------------------------------------------------
     1.597 |      0.000 |         144 |   233,616 |   0.000007 | LIBUV:UV-WRITE
     0.702 |      0.000 |         208 |   233,616 |   0.000003 | CL-ASYNC::WRITE-TO-UVSTREAM
     0.208 |      0.000 |      32,864 |   233,658 |   0.000001 | CL-ASYNC::CHECK-SOCKET-OPEN
     0.187 |      0.000 |      65,280 |   233,616 |   0.000001 | CL-ASYNC::TCP-WRITE-CB
     0.151 |      0.000 |  17,388,160 |    38,936 |   0.000004 | WOO::HANDLE-REQUEST
     0.150 |      0.000 |           0 |   233,616 |   0.000001 | LIBUV:ALLOC-REQ
     0.148 |      0.004 |  65,545,440 |    38,952 |   0.000004 | WOO::SETUP-PARSER
     0.147 |      0.004 |  15,232,896 |    38,936 |   0.000004 | WOO::PARSE-HOST-HEADER
     0.080 |      0.000 |           0 |   233,616 |   0.000000 | LIBUV:FREE-UV-BUF
     0.077 |      0.055 | 230,200,256 |    38,936 |   0.000002 | WOO.RESPONSE:WRITE-RESPONSE-HEADERS
     0.064 |      0.000 |          48 |   272,561 |   0.000000 | LIBUV:ALLOC-UV-BUF
     0.049 |      0.000 |           0 |   233,632 |   0.000000 | CL-ASYNC:SOCKET-C
     0.031 |      0.000 |           0 |   233,616 |   0.000000 | CL-ASYNC::SOCKET-CONNECTED
     0.031 |      0.000 |          48 |   233,658 |   0.000000 | CL-ASYNC:SOCKET-CLOSED
     0.028 |      0.000 |           0 |    38,945 |   0.000001 | CL-ASYNC::TCP-ALLOC-CB
     0.026 |      0.000 |          16 |   233,616 |   0.000000 | LIBUV:FREE-REQ
     0.023 |      0.000 |   2,749,536 |        42 |   0.000547 | WOO::EVENT-CB
     0.017 |      0.000 |           0 |   272,578 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-CATCH-APP-ERRORS
     0.017 |      0.013 |     392,096 |       234 |   0.000072 | LIBUV:ERRVAL
     0.013 |      0.000 |           0 |    38,945 |   0.000000 | CL-ASYNC::SOCKET-DRAIN-READ-BUFFER
     0.009 |      0.000 |           0 |    38,936 |   0.000000 | WOO::HTTP-VERSION-KEYWORD
     0.008 |      0.000 |           0 |    38,952 |   0.000000 | (SETF CL-ASYNC:SOCKET-DATA)
     0.007 |      0.000 |   1,592,368 |         1 |   0.006992 | CL-ASYNC:TCP-SERVER
     0.007 |      0.000 |     326,672 |        42 |   0.000158 | CL-ASYNC::EVENT-HANDLER
     0.005 |      0.000 |     837,296 |        16 |   0.000305 | CL-ASYNC::INIT-INCOMING-SOCKET
     0.005 |      0.000 |     914,784 |        42 |   0.000116 | CL-ASYNC:CLOSE-SOCKET
     0.002 |      0.000 |     583,424 |         1 |   0.001993 | CL-ASYNC:DELAY
     0.001 |      0.000 |           0 |         1 |   0.001000 | LIBUV:UV-LOOP-INIT
     0.001 |      0.000 |           0 |         2 |   0.000500 | CL-ASYNC::TCP-SERVER-CLOSED
     0.001 |      0.000 |           0 |         2 |   0.000500 | CL-ASYNC:EVENT-FREED
     0.001 |      0.000 |           0 |        16 |   0.000062 | CL-ASYNC::SOCKET-DIRECTION
     0.001 |      0.000 |           0 |        16 |   0.000062 | CL-ASYNC::TCP-SERVER-STREAM
     0.001 |      0.000 |           0 |        18 |   0.000055 | LIBUV:UV-CLOSE
     0.001 |      0.000 |           0 |        16 |   0.000061 | WOO::CONNECT-CB
     0.000 |      0.000 |           0 |    38,936 |   0.000000 | WOO::HANDLE-RESPONSE
     0.000 |      0.000 |      32,768 |    38,936 |   0.000000 | WOO::HANDLE-NORMAL-RESPONSE
     0.000 |      0.000 |  34,299,264 |    38,936 |   0.000000 | WOO::READ-CB
     0.000 |      0.000 |           0 |         1 |   0.000000 | WOO:RUN
     0.000 |      0.000 |           0 |        18 |   0.000000 | LIBUV::HANDLE-TO-VAL
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV::HANDLE-FROM-VAL
     0.000 |      0.000 |           0 |         2 |   0.000000 | LIBUV:UV-LOOP-CLOSE
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:HANDLE-TYPE
     0.000 |      0.000 |           0 |        17 |   0.000000 | LIBUV:UV-READ-STOP
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-IP-4-ADDR
     0.000 |      0.000 |           0 |        18 |   0.000000 | LIBUV:FREE-HANDLE
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-TIMER-START
     0.000 |      0.000 |      33,136 |        42 |   0.000000 | LIBUV:UV-ERR-NAME
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-STOP
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-WALK
     0.000 |      0.000 |           0 |        18 |   0.000000 | LIBUV:ALLOC-HANDLE
     0.000 |      0.000 |           0 |        16 |   0.000000 | LIBUV:UV-ACCEPT
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-TIMER-INIT
     0.000 |      0.000 |           0 |        16 |   0.000000 | LIBUV:UV-READ-START
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-LISTEN
     0.000 |      0.000 |   6,358,960 |         3 |   0.000000 | LIBUV:UV-RUN
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-TCP-BIND
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-LOOP-SIZE
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-TIMER-STOP
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBUV:UV-IS-CLOSING
     0.000 |      0.000 |           0 |        17 |   0.000000 | LIBUV:UV-TCP-INIT
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC::PROCESS-EVENT-LOOP-EXIT-CALLBACKS
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC::CHECK-EVENT-UNFREED
     0.000 |      0.000 |           0 |        17 |   0.000000 | CL-ASYNC::DO-CLOSE-TCP
     0.000 |      0.000 |           0 |         1 |   0.000000 | (SETF CL-ASYNC::TCP-SERVER-CLOSED)
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC::TIMER-CB
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC::LOOP-EXIT-WALK-CB
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC::DO-CLOSE-LOOP
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC::CHECK-EVENT-LOOP-RUNNING
     0.000 |      0.000 |           0 |        17 |   0.000000 | CL-ASYNC::TCP-CLOSE-CB
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC::TCP-SERVER-C
     0.000 |      0.000 |      32,512 |    38,945 |   0.000000 | CL-ASYNC::TCP-READ-CB
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC::TIMER-CLOSE-CB
     0.000 |      0.000 |           0 |        16 |   0.000000 | CL-ASYNC::TCP-ACCEPT-CB
     0.000 |      0.000 |           0 |         1 |   0.000000 | (SETF CL-ASYNC:EVENT-FREED)
     0.000 |      0.000 |      32,768 |         1 |   0.000000 | CL-ASYNC:START-EVENT-LOOP
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:ADD-EVENT
     0.000 |      0.000 |           0 |        16 |   0.000000 | (SETF CL-ASYNC:SOCKET-CLOSED)
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:CLOSE-TCP-SERVER
     0.000 |      0.000 |           0 |    38,936 |   0.000000 | CL-ASYNC:SOCKET-DATA
     0.000 |      0.000 |         352 |   233,616 |   0.000000 | CL-ASYNC:WRITE-SOCKET-DATA
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:ADD-EVENT-LOOP-EXIT-CALLBACK
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC:EVENT-C
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:FREE-EVENT
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:EVENT-FREED-P
     0.000 |      0.000 |           0 |        16 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-NUM-CONNECTIONS-IN
     0.000 |      0.000 |           0 |        16 |   0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-NUM-CONNECTIONS-IN)
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-DEFAULT-EVENT-HANDLER
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-ID
     0.000 |      0.000 |           0 |        24 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-C
     0.000 |      0.000 |           0 |         3 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-EXIT-FUNCTIONS
     0.000 |      0.000 |           0 |         3 |   0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-EXIT-FUNCTIONS)
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-DATA-REGISTRY
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-FUNCTION-REGISTRY
-------------------------------------------------------------
     3.795 |      0.076 | 376,651,296 | 3,622,026 |            | Total

estimated total profiling overhead: 3.89 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.074e-6s total profiling, 4.66e-7s internal profiling

So far, I'm not really sure why uv_write is so expensive.

Yeah, right. It's weird.

JFYI, here's the report when running with the master branch (libevent2). Though as:write-to-evbuffer is at the top of it, it's less expensive than uv-write.

  seconds  |     gc     |    consed   |   calls   |  sec/call  |  name  
-------------------------------------------------------------
     0.324 |      0.000 |   8,320,544 |   349,674 |   0.000001 | CL-ASYNC::WRITE-TO-EVBUFFER
     0.224 |      0.000 |         304 |    58,279 |   0.000004 | CL-ASYNC::TCP-WRITE-CB
     0.214 |      0.000 |  99,552,208 |    58,279 |   0.000004 | WOO::HANDLE-REQUEST
     0.200 |      0.006 |   4,643,040 |    58,279 |   0.000003 | WOO::PARSE-HOST-HEADER
     0.161 |      0.000 |         208 |   407,985 |   0.000000 | CL-ASYNC::CHECK-SOCKET-OPEN
     0.142 |      0.061 | 309,842,720 |    58,279 |   0.000002 | WOO.RESPONSE:WRITE-RESPONSE-HEADERS
     0.123 |      0.005 | 105,812,432 |    58,295 |   0.000002 | WOO::SETUP-PARSER
     0.080 |      0.000 |           0 |   349,674 |   0.000000 | LIBEVENT2:EVBUFFER-ADD
     0.063 |      0.000 |           0 |   349,690 |   0.000000 | LIBEVENT2:BUFFEREVENT-ENABLE
     0.062 |      0.000 |           0 |   116,558 |   0.000001 | LIBEVENT2:EVBUFFER-REMOVE
     0.040 |      0.000 |           0 |   233,352 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-DATA-REGISTRY
     0.035 |      0.000 |           0 |   233,290 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-FUNCTION-REGISTRY
     0.034 |      0.000 |   1,343,488 |   349,674 |   0.000000 | LIBEVENT2:BUFFEREVENT-GET-OUTPUT
     0.026 |      0.000 |           0 |    58,295 |   0.000000 | (SETF CL-ASYNC:SOCKET-DATA)
     0.019 |      0.000 |           0 |   407,969 |   0.000000 | CL-ASYNC:SOCKET-C
     0.016 |      0.000 |   3,349,088 |         1 |   0.015993 | CL-ASYNC:TCP-SERVER
     0.011 |      0.000 |           0 |   116,591 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-CATCH-APP-ERRORS
     0.005 |      0.000 |           0 |    58,279 |   0.000000 | WOO::HTTP-VERSION-KEYWORD
     0.005 |      0.000 |     717,472 |        16 |   0.000302 | CL-ASYNC::INIT-INCOMING-SOCKET
     0.004 |      0.000 |           0 |   407,969 |   0.000000 | CL-ASYNC:SOCKET-CLOSED
     0.003 |      0.000 |           0 |    58,279 |   0.000000 | CL-ASYNC::SOCKET-DRAIN-READ-BUFFER
     0.003 |      0.000 |     612,400 |         1 |   0.002991 | CL-ASYNC:DELAY
     0.002 |      0.000 |     163,616 |        16 |   0.000108 | CL-ASYNC::TCP-EVENT-CB
     0.001 |      0.000 |           0 |    58,279 |   0.000000 | LIBEVENT2:BUFFEREVENT-GET-INPUT
     0.001 |      0.000 |           0 |         1 |   0.001000 | LIBEVENT2:EVENT-BASE-NEW
     0.001 |      0.000 |           0 |         1 |   0.001000 | CL-ASYNC::TCP-SERVER-C
     0.001 |      0.000 |           0 |        16 |   0.000062 | LIBEVENT2:BUFFEREVENT-FREE
     0.001 |      0.000 |           0 |        16 |   0.000062 | CL-ASYNC::TCP-SERVER-STREAM
     0.000 |      0.000 |           0 |    58,279 |   0.000000 | WOO::HANDLE-RESPONSE
     0.000 |      0.005 |      32,912 |    58,279 |   0.000000 | WOO::HANDLE-NORMAL-RESPONSE
     0.000 |      0.000 |   2,713,488 |    58,279 |   0.000000 | WOO::READ-CB
     0.000 |      0.000 |           0 |        16 |   0.000000 | WOO::CONNECT-CB
     0.000 |      0.000 |           0 |        16 |   0.000000 | WOO::EVENT-CB
     0.000 |      0.000 |           0 |         1 |   0.000000 | WOO:RUN
     0.000 |      0.000 |           0 |        16 |   0.000000 | LIBEVENT2:BUFFEREVENT-DISABLE
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBEVENT2:EVENT-ACTIVE
     0.000 |      0.000 |           0 |        16 |   0.000000 | LIBEVENT2:BUFFEREVENT-GETFD
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBEVENT2:EVENT-BASE-FREE
     0.000 |      0.000 |           0 |        16 |   0.000000 | LIBEVENT2:BUFFEREVENT-SETCB
     0.000 |      0.000 |           0 |        16 |   0.000000 | LIBEVENT2:EVUTIL-MAKE-SOCKET-NONBLOCKING
     0.000 |      0.000 |   2,052,096 |         1 |   0.000000 | LIBEVENT2:EVENT-BASE-DISPATCH
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBEVENT2:EVENT-FREE
     0.000 |      0.000 |           0 |        16 |   0.000000 | LIBEVENT2:BUFFEREVENT-SOCKET-NEW
     0.000 |      0.000 |           0 |        16 |   0.000000 | LIBEVENT2:BUFFEREVENT-SET-TIMEOUTS
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBEVENT2:EVENT-NEW
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBEVENT2:EVCONNLISTENER-FREE
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBEVENT2:EVCONNLISTENER-NEW-BIND
     0.000 |      0.000 |           0 |        16 |   0.000000 | LIBEVENT2:EVCONNLISTENER-GET-BASE
     0.000 |      0.000 |           0 |         1 |   0.000000 | LIBEVENT2:EVCONNLISTENER-SET-ERROR-CB
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC::PROCESS-EVENT-LOOP-EXIT-CALLBACKS
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC::CHECK-EVENT-UNFREED
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC::TCP-SERVER-CLOSED
     0.000 |      0.000 |           0 |         1 |   0.000000 | (SETF CL-ASYNC::TCP-SERVER-CLOSED)
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC::TIMER-CB
     0.000 |      0.000 |           0 |        32 |   0.000000 | CL-ASYNC::SOCKET-DIRECTION
     0.000 |      0.005 |     328,160 |    58,279 |   0.000000 | CL-ASYNC::READ-SOCKET-DATA
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC::EVENT-FREE-CALLBACK
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC::CHECK-EVENT-LOOP-RUNNING
     0.000 |      0.000 |           0 |        16 |   0.000000 | CL-ASYNC::TCP-ACCEPT-CB
     0.000 |      0.000 |         160 |    58,279 |   0.000000 | CL-ASYNC::TCP-READ-CB
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:ADD-EVENT
     0.000 |      0.000 |           0 |    58,279 |   0.000000 | CL-ASYNC:SOCKET-DATA
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC:EVENT-FREED
     0.000 |      0.000 |           0 |         1 |   0.000000 | (SETF CL-ASYNC:EVENT-FREED)
     0.000 |      0.000 |      78,096 |         1 |   0.000000 | CL-ASYNC:START-EVENT-LOOP
     0.000 |      0.000 |         416 |   349,674 |   0.000000 | CL-ASYNC:WRITE-SOCKET-DATA
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:CLOSE-TCP-SERVER
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:FREE-EVENT
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC:EVENT-C
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC:ADD-EVENT-LOOP-EXIT-CALLBACK
     0.000 |      0.000 |           0 |        16 |   0.000000 | CL-ASYNC:SET-SOCKET-TIMEOUTS
     0.000 |      0.000 |           0 |        16 |   0.000000 | CL-ASYNC:CLOSE-SOCKET
     0.000 |      0.000 |           0 |        16 |   0.000000 | (SETF CL-ASYNC:SOCKET-CLOSED)
     0.000 |      0.000 |           0 |        32 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-NUM-CONNECTIONS-IN
     0.000 |      0.000 |           0 |        32 |   0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-NUM-CONNECTIONS-IN)
     0.000 |      0.000 |           0 |         1 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-DEFAULT-EVENT-HANDLER
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-ID
     0.000 |      0.000 |           0 |         8 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-C
     0.000 |      0.000 |           0 |         2 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-EXIT-FUNCTIONS
     0.000 |      0.000 |           0 |         2 |   0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-EXIT-FUNCTIONS)
     0.000 |      0.000 |           0 |         1 |   0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-FUNCTION-REGISTRY)
     0.000 |      0.000 |           0 |         1 |   0.000000 | (SETF CL-ASYNC-BASE:EVENT-BASE-DATA-REGISTRY)
-------------------------------------------------------------
     1.800 |      0.082 | 539,562,848 | 4,546,734 |            | Total

orthecreedence commented 10 years ago

I almost wonder if the problem could be worked around...buffer data in lisp to cut down on # of calls to uv_write such that the data is stored inside the socket object and sent either a) on the next loop of the event loop or b) once the socket's output buffer is full. This way numerous calls to write-to-uvstream in the same pass of the loop would result in at most one (or a few, if a lot of data is sent) calls to uv_write. However, uv_write might just end up being linearly as slow with the amount of data being pumped into it, in which case there's not a lot we can do about it. Thoughts?

fukamachi commented 10 years ago

Woo calls as:write-socket-data for each header lines. It might be a problem?

I tried changing it to reduce the number of calls by creating a byte vector of the whole headers first, then, interestingly, the score went up. (10162.96 -> 13898.52 req/sec)

As the strategy also works with libevent2-backed cl-async and libevent2 version still records better score (16417.39 req/sec), it's not a specific bottleneck of uv branch, though.

orthecreedence commented 10 years ago

Odd that the libevent2 branch is faster, especially because it doesn't have the static-vectors optimization, which I assumed would cut down the read/write speeds significantly. The only thing I can think is...libevent is a lot faster than libuv. I'm having trouble finding low-level benchmarks comparing the two. I always just assumed libuv was faster, but it might not be.

I wonder if the static-vectors changes were done on master, would a wookie/woo app be able to beat the performance of node? I can try this tonight. I'm not married to using libuv, especially because the lack of built-in SSL support is going to cause me some late nights over the next week or so. However the libevent SSL implementation is lacking and may need to be redone by hand anyway, and libuv is much more actively maintained.

Either way, if a compiled language wrapping a C library can't beat a VM language wrapping a C library, then I must have done something wrong along the way. Maybe it makes sense to stick with libuv and focus on beating node instead of beating the libevent branch.

The write buffering looks like it should help a bit from your testing, so I'll give that a try tonight as well if I get a chance.

Thanks for all the testing and benchmarking you've done!

fukamachi commented 10 years ago

The only thing I can think is...libevent is a lot faster than libuv.

I never thought that but if so, the profiling result makes sense.

I wonder if the static-vectors changes were done on master, would a wookie/woo app be able to beat the performance of node?

Definitly. Actually, Woo outperforms Node a little bit after reducing calls of as:write-socket-data. I'm sure Wookie also can beat it since the major differences between Woo and Wookie are only the depending libraries (fast-http/http-parse and quri/puri).

orthecreedence commented 10 years ago

This is good news. I'll write that buffering code for write-to-uvstream to reduce calls to uv_write and hopefully that will speed things up a good amount. I noticed some really nasty data overlap bugs popping up when using Wookie on the new UV branch when serving some real app data. So I need to figure that stuff out before the libuv branch can be used for anything other than benchmarks (it's probably a one-line fix I need to make somewhere, but might take a few hours to track down).

Also, do you know when fast-http will have multipart parsing? Once that gets in, I'll happily switch Wookie over to fast-http/quri.

fukamachi commented 10 years ago

Oh, I can't wait to try it!

Also, do you know when fast-http will have multipart parsing? Once that gets in, I'll happily switch Wookie over to fast-http/quri.

Multipart parser is already a part of fast-http. Although :multipart-callback of make-parser is deleted, you can do the same thing by using :body-callback and make-multipart-parser. I thought it's cleaner.

Another major difference is the lacks of :store-body and :force-stream option because it always forces streaming. I can't recommend to use :store-body T naively without thinking about large POST attacks seriously. As I see no good solution to protect from attackers currently, I leave the storing-body part to application users.

orthecreedence commented 10 years ago

Oh good, I didn't know the multipart parser was done!

Another major difference is the lacks of :store-body and :force-stream option because it always forces streaming. I can't recommend to use :store-body T naively without thinking about large POST attacks seriously.

Agreed, which is why this was the default in http-parse (to not store the body unless explicitly asked for), but I also think removing it completely and forcing the app to deal with storage is a good move. I will have to re-wire some of the POST stuff in Wookie probably, but that's not really a big issue. I also agree with forcing streaming as well.

orthecreedence commented 10 years ago

BTW just implemented the socket write buffering, and also replaced some other inefficient buffering code as well. I'm currently working on the issue where multiple requests coming in at once get the responses mixed up. I'm working on creating a reproducible test case and seeing if I can find the problem.

So you can use the current changes for benchmarking, but you may run into weird request/response data overlap if you use the uv branch for anything else.

orthecreedence commented 10 years ago

Data corruption issues are fixed. Turns out it wasn't cross-socket pollution, it was a much simpler problem where responses > 64K were getting the same data sent for each chunk (as opposed to the chunks progressing over the length of the full write).

fukamachi commented 10 years ago

Odd thing happened.

With the latest Woo (rev:dae3d0)

Before

It seems after I changed Woo for reducing calls of as:write-socket-data, the uv branch's score was almost the same as master's.

	uv	master
Requests/sec	16294.04	16231.17
Transfer/sec	2.05MB	2.04MB

I never noticed that.

After

And I git pulled the latest cl-async and benchmarked again, then, interestingly, the score went down.

	uv	master
Requests/sec	12405.08	16231.17
Transfer/sec	1.56MB	2.04MB

It seems the recent changes doesn't suit to the current Woo strategy.

With a few days ago Woo (rev:793860)

I also benchmarked with a little bit older Woo (which calls as:write-socket-data naively).

Before

	uv	master
Requests/sec	10365.12	15315.03
Transfer/sec	1.13MB	1.67MB

After

	uv	master
Requests/sec	11629.93	15315.03
Transfer/sec	1.26MB	1.67MB

Wow, the recent changes work on this Woo.

fukamachi commented 10 years ago

~~I changed Woo a little bit to call as:write-socket-data twice for headers and body (not once), then the score went back to almost the same to before's. Interesting.~~

No, it didn't.. sorry.

fukamachi commented 10 years ago

I fixed the above benchmark result a little bit.

I'm still trying that, but probably e7df0e44944d0c98a555509ccb6fab4d9f653b76 is better than the latest uv branch.

orthecreedence commented 10 years ago

Interesting, I thought buffering would have helped, but maybe it's best to let libuv handle it. I have no problem reverting those changes, however it almost seems there's an optimal amount of buffering that should happen between buffering all calls in one loop pass and calling uv_write directly for everything. I wonder if there's a sweet spot here. This seems like a pretty aggressive optimization though, so for now I'll revert to the non-buffering write-socket-data and we can revisit later if we need to.

I also wonder if replacing flexi-streams with fast-io will increase buffering speed to the point of making a significant difference. I wanted to try this last night but ran out of time. I will see if I can find time tonight.

orthecreedence commented 10 years ago

A few changes. I made write buffering switchable via *buffer-writes*.(nil by default). This should allow easy benchmarking.

I replaced all buffering in sockets/streams with fast-io (make sure you pull fast-io from master for the new exported functions). I hope this will speed things up a bit more.

I also am exporting *buffer-size* which allows setting the size of the input and output buffers (which are static vectors). However, I noticed that when set above the recommended 65535 bytes, calls to uv_write seem to have really odd data crossover between sockets. This is tested for in the bundled no-overlap test. I spent a while trying to figure out what's going on, and my conclusion is that it's a libuv bug...the data being passed to uv_write seems to not match up with the data received on the socket. This has only been tested with *buffer-size* == (* 1024 128) using local connections in the same event loop. For now, I'd leave *buffer-size* as is.

fukamachi commented 10 years ago

I benchmarked again with the latest cl-async and fast-io. Here's the report:

	Before	After
Requests/sec	12293.87	16029.19
Transfer/sec	1.55MB	2.02MB

Before: 7d69b73
After: 48be07b

Hey, it's much better than before!

fukamachi commented 10 years ago

Here's the latest profiling report:

  seconds  |     gc     |    consed   |   calls   |  sec/call  |  name  
-------------------------------------------------------------
     1.909 |      0.047 | 436,468,864 |    99,788 |   0.000019 | WOO.RESPONSE:RESPONSE-HEADERS-BYTES
     0.793 |      0.000 |          64 |    99,788 |   0.000008 | LIBUV:UV-WRITE
     0.563 |      0.012 |  43,499,152 |    99,788 |   0.000006 | WOO::HANDLE-REQUEST
     0.403 |      0.006 |  49,854,528 |    99,788 |   0.000004 | WOO::PARSE-HOST-HEADER
     0.318 |      0.000 |           0 |    99,788 |   0.000003 | CL-ASYNC::WRITE-TO-UVSTREAM
     0.127 |      0.000 |         288 |    99,788 |   0.000001 | CL-ASYNC::TCP-WRITE-CB
     0.081 |      0.000 |           0 |    99,788 |   0.000001 | LIBUV:ALLOC-REQ
     0.065 |      0.000 |           0 |    99,788 |   0.000001 | LIBUV:FREE-UV-BUF
     0.063 |      0.000 |           0 |    99,804 |   0.000001 | CL-ASYNC::TCP-ALLOC-CB
     0.045 |      0.000 |           0 |    99,804 |   0.000000 | CL-ASYNC::CHECK-SOCKET-OPEN
     0.034 |      0.000 |           0 |   199,592 |   0.000000 | LIBUV:ALLOC-UV-BUF
     0.032 |      0.000 |           0 |   199,609 |   0.000000 | CL-ASYNC-BASE:EVENT-BASE-CATCH-APP-ERRORS
     0.030 |      0.000 |           0 |    99,804 |   0.000000 | CL-ASYNC::SOCKET-DRAIN-READ-BUFFER
     0.028 |      0.000 |           0 |    99,788 |   0.000000 | LIBUV:FREE-REQ
     0.026 |      0.015 |   1,599,936 |         1 |   0.025992 | CL-ASYNC:TCP-SERVER
     0.026 |      0.000 |           0 |    99,788 |   0.000000 | WOO::HTTP-VERSION-KEYWORD

Most of slow, consing parts belong to Woo. Good job!

The current Woo implementation creates a response octet byte vector every time and throws it to as:write-socket-data. Probably I can avoid the data copying by writing into cl-async's *output-buffer* directly.

orthecreedence commented 10 years ago

This is great news! I'm glad we got the numbers looking this good.

Was this latest benchmark with *buffer-writes* set to t or nil? I'm curious to see if buffering the writes with fast-io was more or less performant than passing the writes directly through to uv_write.

orthecreedence commented 10 years ago

Oh, I actually never pushed the *buffer-writes* stuff. Oops. Should be up there now.

bnoordhuis commented 9 years ago

@orthecreedence @fukamachi Did either of you run a side-by-side strace or perf? I would expect libevent and libuv to have roughly similar performance because normally the real limiting factor is the operating system and the network.

If you can give me step-by-step instructions on how to reproduce your setup, I'll be happy to take a look.

orthecreedence commented 9 years ago

I think it would make sense to exclude the HTTP stuff (or at least the parsing aspect) and just return hard-coded responses, that way we could easily benchmark just the TCP aspect of it (which is what we care about) instead of the TCP/HTTP/routing/etc.

I can work on whipping up a simple test script tonight that does this. I also uploaded a pre-merge-uv tag that can be used to switch between libuv and libevent.

To test, all that is required is SBCL (with quicklisp installed), libuv-v1.x/libevent-2.0.2x installed, and cl-async downloaded via git to ~/quicklisp/local-projects/cl-async.

I'll try to get that benchmark script out sooner rather than later, w/ running instructions.

@bnoordhuis do you have any guidelines/best practices on benchmarking? You've probably done it a bit more than me.

orthecreedence commented 9 years ago

Benchmark script here: https://gist.github.com/orthecreedence/b8a1796aed8023fe8a57

Setup:

$ wget http://beta.quicklisp.org/quicklisp.lisp
$ /path/to/sbcl/run-sbcl.sh --load quicklisp.lisp
* (quicklisp-quickstart:install)
* (ql:add-to-init-file)
* (quit)
$ mkdir -p ~/quicklisp/local-projects
$ cd ~/quicklisp/local-projects
$ git clone https://github.com/orthecreedence/cl-async
$ git clone https://github.com/rpav/fast-io
$ cd cl-async/
$ git checkout -b libevent pre-merge-uv    # create a libevent branch
$ git checkout master   # back to libuv

Running:

/path/to/sbcl/run-sbcl.sh --load /path/to/cl-async-bench.lisp
...
Listening on 8008

Benchmarking (using wrk):

wrk -c 25 -t 1 -d 10 http://127.0.0.1:8008/

My results (libuv):

Running 10s test @ http://127.0.0.1:8008/
  1 threads and 25 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   603.62us  482.57us   6.89ms   89.66%
    Req/Sec    39.75k     8.69k   59.44k    65.64%
  373468 requests in 10.00s, 35.62MB read
Requests/sec:  37355.37
Transfer/sec:      3.56MB

libevent:

Running 10s test @ http://127.0.0.1:8008/
  1 threads and 25 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   586.37us    1.07ms  21.85ms   96.11%
    Req/Sec    41.60k     9.18k   65.56k    66.17%
  391896 requests in 10.00s, 37.37MB read
Requests/sec:  39193.57
Transfer/sec:      3.74MB

Although the spread is much smaller in this test, libevent is still consistently faster.

orthecreedence commented 9 years ago

Just updated the benchmark script to split the HTTP response into two writes, and the differences are now extreme (this can be turned off by passing :split-writes nil in the (start-server ...) call).

libuv:

Running 10s test @ http://127.0.0.1:8008/
  1 threads and 25 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    40.25ms    1.08ms  55.30ms   94.48%
    Req/Sec   623.26     22.56   716.00     87.07%
  6179 requests in 10.00s, 605.37KB read
Requests/sec:    617.76
Transfer/sec:     60.52KB

libevent:

Running 10s test @ http://127.0.0.1:8008/
  1 threads and 25 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   570.89us  743.13us  19.27ms   94.95%
    Req/Sec    40.88k     8.95k   64.33k    66.67%
  383393 requests in 10.00s, 36.56MB read
Requests/sec:  38342.94
Transfer/sec:      3.66MB

It seems that splitting the HTTP response into two calls vs just the one slows things down by about 60x =[. this seems really steep, so I can't discount this being a problem in cl-async.

bnoordhuis commented 9 years ago

Hey @orthecreedence, thanks for the instructions! I did some black box testing and I can confirm your findings. Below are my own, tested with sbcl 1.2.4-1.fc21 on x86_64 linux:

I observe that cl-libuv spends fewer time in fewer I/O related system calls (good!) but actual throughput is massively lower compared to the libevent branch (clearly bad!)
There are roughly twice as many mmap and munmap calls. The total count is low (< 200) and total wall clock time spent inside those system calls is negligible but that's probably because the real cost gets deferred to pagefaults, due to on-demand allocation. cl-libuv consistently incurs 30% more pagefaults than does libevent so that's clearly bad.
cl-libuv also incurs almost 4x as many context switches as the libevent backend. As a result, real CPU usage is only 20% vs. 85% for libevent. The process spends most of its wall clock time blocked on pagefaults. Clearly bad.

Everything so far points to memory management issues, not event loop or I/O latency. cl-libuv seems to consistently use a little more memory - about 145 MB vs 135 MB max - but I don't know how significant that is.

At any rate, once memory pressure is taken out of the equation, I expect cl-libuv to take a slight lead over the libevent backend. As to how easy that is to fix, you probably know that better than I do. :-) Hope that helps!

bnoordhuis commented 9 years ago

cl-libuv seems to consistently use a little more memory - about 145 MB vs 135 MB max - but I don't know how significant that is.

Probably insignificant. I just tried sbcl HEAD and now it's 127 MB vs 135 MB in favor of cl-libuv but everything else is still the same.

I also built a custom sbcl that mmaps memory eagerly with MAP_POPULATE but that didn't have much impact. I didn't really expect it too, either; sbcl is too advanced for a simple trick like that. I also tried an sbcl build with --dynamic-space-size=1.5G but that only made things worse.

I looked into why cl-libuv sees so much more context switches and most happen inside the epoll_wait() system call. Context switching inside epoll_wait() normally means that the process is blocking for I/O but I'm not sure what could make that happen here. Maybe my memory pressure hypothesis was wrong and this is a latency issue after all.

ivan4th commented 9 years ago

On my machine the benchmark script gives consistently about 625 requests/s with both :split-writes nil and :split-writes t. SBCL 1.2.6, current cl-async master.

ivan4th commented 9 years ago

What actually helps is doing as:close-socket for incoming connection after as:write-socket-data:

Running 10s test @ http://127.0.0.1:8009/
  1 threads and 25 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   563.81us  274.98us   8.57ms   98.12%
    Req/Sec    32.30k     2.94k   35.33k    95.15%
  303383 requests in 10.00s, 28.93MB read
  Socket errors: connect 0, read 303378, write 0, timeout 0
Requests/sec:  30338.82
Transfer/sec:      2.89MB

So, what's actually happening here is event loop being overloaded with open sockets. Don't know why libevent does better though, and why you had better results without spliitting writes.

bnoordhuis commented 9 years ago

@ivan4th That's curious. With :split-writes nil, I get numbers comparable to libevent.

I did figure out why :split-writes t performs so atrocious. cl-async doesn't coalesce writes so the extra write creates an extra TCP round-trip. By default on Linux, that introduces 40 ms of latency with loopback traffic because of the delayed ACK.

orthecreedence commented 9 years ago

This is great feedback! So what I'm thinking is it would make sense overall to do write buffering on TCP sockets. Is this correct?

bnoordhuis commented 9 years ago

@orthecreedence If you can find a sweet spot between latency and throughput, then yes, buffering should help a great deal. Alternatively, you could set uv_tcp_nodelay(handle, 1) to disable Nagle's algorithm but that may have an adverse effect when there are many small writes.

orthecreedence / cl-async