Latency benchmarking cohttp shows queueing

Hi,

So I'm in the process of doing latency-benchmarks for a couple of different web server frameworks. One of the frameworks is cohttp. I have the following code for the test:

(* This file is in the public domain *)

open Core.Std
open Async.Std
open Cohttp_async

(* given filename: hello_world.ml compile with:
   $ corebuild hello_world.native -pkg cohttp.async
*)

let handler ~body:_ _sock req =
  let uri = Cohttp.Request.uri req in
  match Uri.path uri with
  | "/" -> Server.respond_with_string "CHAPTER I. Down the Rabbit-Hole  Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, <and what is the use of a book,> thought Alice <without pictures or conversations?> So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, <Oh dear! Oh dear! I shall be late!> (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge. In another moment down went Alice after it, never once considering how in the world she was to get out again. The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well. Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards......"
  | _ ->
    Server.respond_with_string ~code:`Not_found "Route not found"

let start_server port () =
  eprintf "Listening for HTTP on port %d\n" port;
  eprintf "Try 'curl http://localhost:%d/'\n%!" port;
  Cohttp_async.Server.create ~on_handler_error:`Raise
    (Tcp.on_port port) handler
  >>= fun _ -> Deferred.never ()

let () =
  Command.async_basic
    ~summary:"Start a hello world Async server"
    Command.Spec.(empty +>
      flag "-p" (optional_with_default 8080 int)
        ~doc:"int Source port to listen on"
    ) start_server

  |> Command.run

and it is built with:

#!/bin/sh

corebuild main.native -pkg cohttp.async

echo "Starting server"
ASYNC_CONFIG='((epoll_max_ready_events 10240) (max_num_open_file_descrs 32000) (max_num_threads 50) (file_descr_watcher Epoll) (max_num_jobs_per_priority_per_cycle 500))' ./main.native

There are 2 servers, one load-generating, and one running the cohttp server, over a 1gigabit line (iperf shows 953mbit throughput), with no latency whatsoever (There is a switch between, but that is all).

The servers are linux: Linux lady-of-pain 3.19.3-3-ARCH #1 SMP PREEMPT Wed Apr 8 14:10:00 CEST 2015 x86_64 GNU/Linux

fairly recent Core i7's, no virtualization since it just slows you down.

Some sysctl.conf tuning were necessary:

fs.file-max = 65535
fs.nr_open = 65535
net.core.netdev_max_backlog = 4096
net.core.rmem_max = 16777216
net.core.somaxconn = 65535
net.core.wmem_max = 16777216
net.ipv4.ip_forward = 0
net.ipv4.ip_local_port_range = 1025       65535
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_max_syn_backlog = 20480
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
vm.min_free_kbytes = 65536
vm.overcommit_memory = 1

Async is the recent point-release and I'm forcing Epoll support because the test just fails when it runs in a select(2) loop.

The load generator is wrk2 (github.com/giltene/wrk2) which avoids coordinated omission. Most load generators coordinate: if a connection stalls towards the system-under-test (SUT), then no further requests are issued on that connection until the first one completes. You get one "bad" latency number. In wrk2, the rate is kept stable: at an interval, new requests are planned to be issued on the stalling connection, so one can get more realistic latencies from stalled connections.

We run the following:

#!/bin/sh

exec wrk2 \
    -t8 -c10000 -d60S \
    --timeout 2000 \
    -R 30000 --latency \
    -H 'Connection: keep-alive' \
    http://172.16.0.1:$1/

That is, 10k connections, and a rate of 30k req/s which means 3 req/s per connection. Such a run yields the following latency histogram:

Running 1m test @ http://172.16.0.1:8080/
  8 threads and 10000 connections
  Thread calibration: mean lat.: 1359.711ms, rate sampling interval: 4898ms
  Thread calibration: mean lat.: 1310.689ms, rate sampling interval: 4870ms
  Thread calibration: mean lat.: 1312.983ms, rate sampling interval: 4866ms
  Thread calibration: mean lat.: 1378.042ms, rate sampling interval: 4874ms
  Thread calibration: mean lat.: 1371.283ms, rate sampling interval: 4898ms
  Thread calibration: mean lat.: 1286.566ms, rate sampling interval: 4861ms
  Thread calibration: mean lat.: 1276.854ms, rate sampling interval: 4861ms
  Thread calibration: mean lat.: 1302.831ms, rate sampling interval: 4870ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.88s     5.44s    0.95m    84.10%
    Req/Sec     1.59k   151.82     1.84k    71.88%
  Latency Distribution (HdrHistogram - Recorded Latency)
 50.000%    8.20s 
 75.000%   10.35s 
 90.000%   11.79s 
 99.000%   31.88s 
 99.900%    0.90m 
 99.990%    0.94m 
 99.999%    0.95m 
100.000%    0.95m 

  Detailed Percentile spectrum:
       Value   Percentile   TotalCount 1/(1-Percentile)

    2670.591     0.000000            3         1.00
    3868.671     0.100000        55526         1.11
    5189.631     0.200000       111201         1.25
    6111.231     0.300000       166684         1.43
    7385.087     0.400000       222341         1.67
    8204.287     0.500000       277984         2.00
    8716.287     0.550000       305353         2.22
    9109.503     0.600000       333100         2.50
    9510.911     0.650000       361000         2.86
    9953.279     0.700000       389927         3.33
   10346.495     0.750000       416615         4.00
   10551.295     0.775000       430281         4.44
   10764.287     0.800000       444321         5.00
   10977.279     0.825000       458356         5.71
   11206.655     0.850000       472177         6.67
   11517.951     0.875000       485909         8.00
   11649.023     0.887500       492901         8.89
   11788.287     0.900000       500109        10.00
   12017.663     0.912500       506582        11.43
   14565.375     0.925000       513515        13.33
   16875.519     0.937500       520495        16.00
   18087.935     0.943750       523937        17.78
   19480.575     0.950000       527398        20.00
   20889.599     0.956250       530869        22.86
   22167.551     0.962500       534331        26.67
   23658.495     0.968750       537824        32.00
   24870.911     0.971875       539535        35.56
   26443.775     0.975000       541272        40.00
   27672.575     0.978125       543007        45.71
   28803.071     0.981250       544740        53.33
   30031.871     0.984375       546473        64.00
   30605.311     0.985938       547358        71.11
   31113.215     0.987500       548217        80.00
   31604.735     0.989062       549099        91.43
   32063.487     0.990625       549946       106.67
   32620.543     0.992188       550832       128.00
   32980.991     0.992969       551254       142.22
   33439.743     0.993750       551684       160.00
   34078.719     0.994531       552126       182.86
   34668.543     0.995313       552554       213.33
   35422.207     0.996094       552984       256.00
   35979.263     0.996484       553196       284.44
   36863.999     0.996875       553416       320.00
   37552.127     0.997266       553636       365.71
   38338.559     0.997656       553851       426.67
   39518.207     0.998047       554065       512.00
   52789.247     0.998242       554171       568.89
   53313.535     0.998437       554300       640.00
   53477.375     0.998633       554390       731.43
   53706.751     0.998828       554505       853.33
   53936.127     0.999023       554604      1024.00
   54099.967     0.999121       554666      1137.78
   54231.039     0.999219       554719      1280.00
   54394.879     0.999316       554779      1462.86
   54525.951     0.999414       554829      1706.67
   54657.023     0.999512       554878      2048.00
   54755.327     0.999561       554912      2275.56
   54820.863     0.999609       554934      2560.00
   54919.167     0.999658       554961      2925.71
   55017.471     0.999707       554986      3413.33
   55148.543     0.999756       555013      4096.00
   55214.079     0.999780       555025      4551.11
   55312.383     0.999805       555041      5120.00
   55410.687     0.999829       555052      5851.43
   55574.527     0.999854       555066      6826.67
   56131.583     0.999878       555080      8192.00
   56229.887     0.999890       555087      9102.22
   56295.423     0.999902       555092     10240.00
   56393.727     0.999915       555102     11702.86
   56459.263     0.999927       555108     13653.33
   56557.567     0.999939       555117     16384.00
   56557.567     0.999945       555117     18204.44
   56590.335     0.999951       555121     20480.00
   56623.103     0.999957       555124     23405.71
   56688.639     0.999963       555128     27306.67
   56721.407     0.999969       555132     32768.00
   56721.407     0.999973       555132     36408.89
   56754.175     0.999976       555134     40960.00
   56786.943     0.999979       555137     46811.43
   56786.943     0.999982       555137     54613.33
   56819.711     0.999985       555138     65536.00
   56852.479     0.999986       555141     72817.78
   56852.479     0.999988       555141     81920.00
   56852.479     0.999989       555141     93622.86
   56852.479     0.999991       555141    109226.67
   56885.247     0.999992       555143    131072.00
   56885.247     0.999993       555143    145635.56
   56885.247     0.999994       555143    163840.00
   56918.015     0.999995       555144    187245.71
   56918.015     0.999995       555144    218453.33
   56918.015     0.999996       555144    262144.00
   56950.783     0.999997       555145    291271.11
   56950.783     0.999997       555145    327680.00
   56950.783     0.999997       555145    374491.43
   56950.783     0.999998       555145    436906.67
   56950.783     0.999998       555145    524288.00
   56983.551     0.999998       555146    582542.22
   56983.551     1.000000       555146          inf
#[Mean    =     8879.930, StdDeviation   =     5441.720]
#[Max     =    56950.784, Total count    =       555146]
#[Buckets =           27, SubBuckets     =         2048]
----------------------------------------------------------
  730491 requests in 1.00m, 1.44GB read
  Socket errors: connect 0, read 434, write 92, timeout 12
Requests/sec:  12175.28
Transfer/sec:     24.59MB

There are a few timeout errors, but note we can't really keep the request rate at what we want, as it is at 12k req/s. The single core on the SUT maxes out, and queuing buildup happens. I've attached the histogram plot (Note the x axis, which is compressed quite a lot so you can see the high latencies) and also included nim which is another single-threaded solution making for a fair comparison. The go solution uses all 8 cores, so it has considerably less work to do per core, i.e., that comparison isn't really fair. The wai solution is a Haskell framework, but it also utilizes all 8 cores.

histogram 11

I have yet to try an Lwt solution with cohttp. It may perform totally different, but I'm not sure from where to cut it. I thought about starting off of the file server examples by Hannes, but I'm not really sure this is the way to go. Also, tuning of the OCaml system is something I'm very interested in.

I haven't had a chance to look in detail yet as I'm travelling this week, but I've provisioned some servers to try this out when back home. This should be good to use with our ongoing multicore work as well.

It would be really useful to have a Cohttp_lwt version of this test, since then we can run it against MirageOS in Xen mode as well. How about an Lwt one that just serves the file from memory, to avoid touching the disk?

Hi, I just translated it as closely as possible to Lwt, here's the code:

NOTE: You need to have libev installed. It's probably in your package manager, or get it here. I added a line to force Lwt to use it, otherwise it'll use select() and the performance will be as horrible as Async with select().

Then run opam install conf-libev

(* This file is in the public domain *)

open Core.Std
open Lwt
open Cohttp_lwt_unix

(* given filename: hello_world.ml compile with:
   $ corebuild -package lwt,cohttp.lwt hello_world.native
*)

let handler _ req _ =
  let uri = Cohttp.Request.uri req in
  match Uri.path uri with
  | "/" -> Server.respond_string ~status:`OK ~body:"CHAPTER I. Down the Rabbit-Hole  Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, <and what is the use of a book,> thought Alice <without pictures or conversations?> So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, <Oh dear! Oh dear! I shall be late!> (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge. In another moment down went Alice after it, never once considering how in the world she was to get out again. The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well. Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards......" ()
  | _ -> Server.respond_string ~status:`Not_found ~body:"Route not found" ()

let start_server port () =
  eprintf "Listening for HTTP on port %d\n" port;
  eprintf "Try 'curl http://localhost:%d/'\n%!" port;
  Server.create
    ~ctx:(Cohttp_lwt_unix_net.init ())
    ~mode:(`TCP (`Port port))
    (Server.make ~callback:handler ())

let () = Lwt_engine.set ~transfer:true ~destroy:true (new Lwt_engine.libev)

let () =
  Command.basic
    ~summary:"Start a hello world Lwt server"
    Command.Spec.(empty +>
      flag "-p" (optional_with_default 8080 int)
        ~doc:"int Source port to listen on"
    ) (fun port () -> start_server port () |> Lwt_unix.run)
  |> Command.run

Awesome! Here is the run with cohttp-lwt:

Running 1m test @ http://172.16.0.1:8080/
  8 threads and 10000 connections
  Thread calibration: mean lat.: 551.730ms, rate sampling interval: 2234ms
  Thread calibration: mean lat.: 640.357ms, rate sampling interval: 2439ms
  Thread calibration: mean lat.: 559.485ms, rate sampling interval: 2193ms
  Thread calibration: mean lat.: 563.842ms, rate sampling interval: 2242ms
  Thread calibration: mean lat.: 627.723ms, rate sampling interval: 2418ms
  Thread calibration: mean lat.: 630.466ms, rate sampling interval: 2414ms
  Thread calibration: mean lat.: 560.149ms, rate sampling interval: 2170ms
  Thread calibration: mean lat.: 628.152ms, rate sampling interval: 2449ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.08s     6.97s    0.97m    88.44%
    Req/Sec     2.64k   230.20     3.06k    57.82%
  Latency Distribution (HdrHistogram - Recorded Latency)
 50.000%    4.83s 
 75.000%    9.32s 
 90.000%   15.01s 
 99.000%   33.06s 
 99.900%   44.63s 
 99.990%    0.94m 
 99.999%    0.97m 
100.000%    0.97m 

  Detailed Percentile spectrum:
       Value   Percentile   TotalCount 1/(1-Percentile)

     206.975     0.000000            1         1.00
    1207.295     0.100000        92046         1.11
    1984.511     0.200000       184105         1.25
    2807.807     0.300000       276059         1.43
    3737.599     0.400000       368157         1.67
    4825.087     0.500000       460323         2.00
    5492.735     0.550000       506101         2.22
    6279.167     0.600000       552285         2.50
    7172.095     0.650000       598261         2.86
    8179.711     0.700000       644090         3.33
    9322.495     0.750000       690325         4.00
    9977.855     0.775000       713302         4.44
   10690.559     0.800000       736336         5.00
   11476.991     0.825000       759153         5.71
   12410.879     0.850000       782182         6.67
   13557.759     0.875000       805251         8.00
   14221.311     0.887500       816613         8.89
   15007.743     0.900000       828172        10.00
   16392.191     0.912500       839662        11.43
   18153.471     0.925000       851176        13.33
   20316.159     0.937500       862676        16.00
   21528.575     0.943750       868409        17.78
   22773.759     0.950000       874197        20.00
   23953.407     0.956250       879890        22.86
   25149.439     0.962500       885637        26.67
   26509.311     0.968750       891433        32.00
   27279.359     0.971875       894274        35.56
   28114.943     0.975000       897168        40.00
   29032.447     0.978125       900007        45.71
   29982.719     0.981250       902900        53.33
   30982.143     0.984375       905774        64.00
   31490.047     0.985938       907207        71.11
   32047.103     0.987500       908657        80.00
   32653.311     0.989062       910094        91.43
   33374.207     0.990625       911507       106.67
   34242.559     0.992188       912961       128.00
   34734.079     0.992969       913685       142.22
   35291.135     0.993750       914412       160.00
   35815.423     0.994531       915160       182.86
   36306.943     0.995313       915828       213.33
   36896.767     0.996094       916543       256.00
   37257.215     0.996484       916938       284.44
   37617.663     0.996875       917268       320.00
   38076.415     0.997266       917630       365.71
   38699.007     0.997656       917980       426.67
   39616.511     0.998047       918336       512.00
   40140.799     0.998242       918515       568.89
   40697.855     0.998437       918692       640.00
   41287.679     0.998633       918875       731.43
   42303.487     0.998828       919051       853.33
   45416.447     0.999023       919229      1024.00
   52789.247     0.999121       919323      1137.78
   53215.231     0.999219       919408      1280.00
   54722.559     0.999316       919501      1462.86
   55017.471     0.999414       919592      1706.67
   55246.847     0.999512       919684      2048.00
   55345.151     0.999561       919750      2275.56
   55443.455     0.999609       919783      2560.00
   55508.991     0.999658       919819      2925.71
   55607.295     0.999707       919864      3413.33
   55738.367     0.999756       919908      4096.00
   55836.671     0.999780       919924      4551.11
   56131.583     0.999805       919951      5120.00
   56164.351     0.999829       919970      5851.43
   56262.655     0.999854       920001      6826.67
   56328.191     0.999878       920020      8192.00
   56459.263     0.999890       920031      9102.22
   56492.031     0.999902       920037     10240.00
   56590.335     0.999915       920048     11702.86
   56688.639     0.999927       920065     13653.33
   56819.711     0.999939       920073     16384.00
   56852.479     0.999945       920077     18204.44
   56918.015     0.999951       920084     20480.00
   56950.783     0.999957       920087     23405.71
   57049.087     0.999963       920094     27306.67
   57212.927     0.999969       920099     32768.00
   57245.695     0.999973       920101     36408.89
   58294.271     0.999976       920106     40960.00
   58327.039     0.999979       920110     46811.43
   58327.039     0.999982       920110     54613.33
   58359.807     0.999985       920114     65536.00
   58359.807     0.999986       920114     72817.78
   58425.343     0.999988       920118     81920.00
   58425.343     0.999989       920118     93622.86
   58425.343     0.999991       920118    109226.67
   58458.111     0.999992       920126    131072.00
   58458.111     1.000000       920126          inf
#[Mean    =     7078.253, StdDeviation   =     6966.924]
#[Max     =    58425.344, Total count    =       920126]
#[Buckets =           27, SubBuckets     =         2048]
----------------------------------------------------------
  1200161 requests in 1.00m, 2.34GB read
  Socket errors: connect 0, read 672, write 2410, timeout 2987
Requests/sec:  19997.27
Transfer/sec:     39.93MB

It has better request rate, but more timeout'ed connections. Here is the latency diagram for the same thing:

histogram 20

It is the same problem: We sit at 100% CPU load and can't keep up with the request rate. Comparisons to Go and Haskell/Wai isn't "fair" in the sense that they use multiple cores, and the test machine has 8 of them. Go hovers around 200% CPU load for instance, so it spreads the work over multiple cores. But the nim solution is run on a single core with a Mark&Sweep GC, so it is definitely possible to get good single-core performance for this test.

The graph scream queueing all over it. And the queue just grows and grows over the course of the test. Even the median is fairly bad at 4 seconds.

Here is the Nim-run for comparison:

Running 1m test @ http://172.16.0.1:8080/
  8 threads and 10000 connections
  Thread calibration: mean lat.: 11.478ms, rate sampling interval: 17ms
  Thread calibration: mean lat.: 11.394ms, rate sampling interval: 17ms
  Thread calibration: mean lat.: 14.299ms, rate sampling interval: 32ms
  Thread calibration: mean lat.: 11.294ms, rate sampling interval: 17ms
  Thread calibration: mean lat.: 11.318ms, rate sampling interval: 17ms
  Thread calibration: mean lat.: 11.391ms, rate sampling interval: 17ms
  Thread calibration: mean lat.: 11.426ms, rate sampling interval: 17ms
  Thread calibration: mean lat.: 11.297ms, rate sampling interval: 16ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    19.01ms   76.83ms   1.85s    94.81%
    Req/Sec     3.86k     1.08k   11.32k    88.60%
  Latency Distribution (HdrHistogram - Recorded Latency)
 50.000%    1.73ms
 75.000%    2.48ms
 90.000%    8.18ms
 99.000%  430.59ms
 99.900%  632.32ms
 99.990%  974.85ms
 99.999%    1.14s 
100.000%    1.85s 

  Detailed Percentile spectrum:
       Value   Percentile   TotalCount 1/(1-Percentile)

       0.199     0.000000            1         1.00
       0.955     0.100000       131257         1.11
       1.173     0.200000       262446         1.25
       1.355     0.300000       394085         1.43
       1.529     0.400000       525270         1.67
       1.730     0.500000       656565         2.00
       1.846     0.550000       721901         2.22
       1.973     0.600000       787672         2.50
       2.115     0.650000       853762         2.86
       2.275     0.700000       919264         3.33
       2.479     0.750000       984476         4.00
       2.609     0.775000      1017002         4.44
       2.779     0.800000      1050031         5.00
       2.979     0.825000      1082567         5.71
       3.229     0.850000      1115503         6.67
       4.487     0.875000      1148181         8.00
       6.303     0.887500      1164581         8.89
       8.179     0.900000      1180986        10.00
      10.183     0.912500      1197366        11.43
      12.871     0.925000      1213796        13.33
      20.607     0.937500      1230166        16.00
      54.591     0.943750      1238365        17.78
     112.319     0.950000      1246570        20.00
     165.759     0.956250      1254788        22.86
     219.007     0.962500      1262966        26.67
     262.399     0.968750      1271208        32.00
     280.831     0.971875      1275288        35.56
     302.591     0.975000      1279394        40.00
     324.863     0.978125      1283472        45.71
     355.583     0.981250      1287593        53.33
     385.535     0.984375      1291693        64.00
     399.615     0.985938      1293756        71.11
     411.903     0.987500      1295778        80.00
     423.167     0.989062      1297827        91.43
     435.967     0.990625      1299898       106.67
     453.631     0.992188      1301931       128.00
     465.919     0.992969      1302954       142.22
     480.767     0.993750      1303977       160.00
     497.151     0.994531      1305012       182.86
     513.535     0.995313      1306024       213.33
     530.943     0.996094      1307053       256.00
     538.623     0.996484      1307575       284.44
     545.791     0.996875      1308092       320.00
     553.471     0.997266      1308612       365.71
     560.127     0.997656      1309125       426.67
     564.735     0.998047      1309641       512.00
     569.343     0.998242      1309884       568.89
     575.487     0.998437      1310132       640.00
     584.703     0.998633      1310389       731.43
     601.599     0.998828      1310642       853.33
     637.439     0.999023      1310894      1024.00
     665.087     0.999121      1311020      1137.78
     693.759     0.999219      1311147      1280.00
     726.015     0.999316      1311277      1462.86
     755.199     0.999414      1311406      1706.67
     786.943     0.999512      1311533      2048.00
     804.351     0.999561      1311596      2275.56
     820.223     0.999609      1311660      2560.00
     842.239     0.999658      1311726      2925.71
     868.351     0.999707      1311788      3413.33
     901.631     0.999756      1311852      4096.00
     914.943     0.999780      1311884      4551.11
     922.111     0.999805      1311916      5120.00
     932.863     0.999829      1311950      5851.43
     944.127     0.999854      1311980      6826.67
     957.951     0.999878      1312013      8192.00
     965.631     0.999890      1312029      9102.22
     976.383     0.999902      1312045     10240.00
     985.087     0.999915      1312060     11702.86
     995.839     0.999927      1312078     13653.33
    1008.127     0.999939      1312093     16384.00
    1017.343     0.999945      1312100     18204.44
    1030.143     0.999951      1312108     20480.00
    1046.015     0.999957      1312116     23405.71
    1058.815     0.999963      1312124     27306.67
    1065.983     0.999969      1312132     32768.00
    1073.151     0.999973      1312136     36408.89
    1082.367     0.999976      1312140     40960.00
    1087.487     0.999979      1312145     46811.43
    1100.799     0.999982      1312149     54613.33
    1121.279     0.999985      1312152     65536.00
    1129.471     0.999986      1312154     72817.78
    1133.567     0.999988      1312158     81920.00
    1133.567     0.999989      1312158     93622.86
    1146.879     0.999991      1312160    109226.67
    1184.767     0.999992      1312162    131072.00
    1186.815     0.999993      1312163    145635.56
    1192.959     0.999994      1312164    163840.00
    1214.463     0.999995      1312165    187245.71
    1228.799     0.999995      1312166    218453.33
    1251.327     0.999996      1312167    262144.00
    1389.567     0.999997      1312168    291271.11
    1389.567     0.999997      1312168    327680.00
    1393.663     0.999997      1312169    374491.43
    1393.663     0.999998      1312169    436906.67
    1476.607     0.999998      1312170    524288.00
    1476.607     0.999998      1312170    582542.22
    1476.607     0.999998      1312170    655360.00
    1517.567     0.999999      1312171    748982.86
    1517.567     0.999999      1312171    873813.33
    1517.567     0.999999      1312171   1048576.00
    1517.567     0.999999      1312171   1165084.44
    1517.567     0.999999      1312171   1310720.00
    1850.367     0.999999      1312172   1497965.71
    1850.367     1.000000      1312172          inf
#[Mean    =       19.005, StdDeviation   =       76.829]
#[Max     =     1849.344, Total count    =      1312172]
#[Buckets =           27, SubBuckets     =         2048]
----------------------------------------------------------
  1710642 requests in 1.00m, 3.38GB read
Requests/sec:  28511.79
Transfer/sec:     57.62MB

@jlouis Thank you very much for doing this. This is hugely appreciated.

Having thought somewhat on these numbers, there are a couple of things which are peculiar.

We have a test which runs for one minute. Some of the latencies are close to that limit, being around 55+ seconds. This means there are connections in there which are stalled for the entirety of the test case. Had the system been a queue, in strict FIFO order, this wouldn't happen. But the same latency chart shows up if you have stalling behavior or processing in LIFO stack order.

This suggests there are some connections which are not even being visited in time by the system before more work is piled on top. A typical simple functional solution would be to take the new work and then run

let old_work = List.append new_work old_work in ...

which would exhibit this problem. Though there are things happened in the upper percentiles which suggest things are somewhat off:

histogram 21

"Puma" is a Ruby framework, which also exhibit queuing and not being able to keep up. But it's latency curve is far "smoother" than what we see in CoHTTP/OCaml. But note that the slowest response time for Puma is around 15 seconds, not close to the 60 second mark.

I'm just wondering if there are any updates or plans?

I'm planning to do a Cohttp bug sweep later this week as part of the next Mirage release. Apologies for the delay...it's been a busy term time here in Cambridge :)

This has to be a pathology somewhere. The differences to other frameworks is simply too high for it to not matter. However, I'd suggest starting out by establishing if you can reproduce the above test case. It's been a while. If you do, note Will Glozer made wrk 4.0.1 in the meantime and Gil has been pulling changes to wrk2 from that. So it is worth looking into if they tests are any different before embarking upon fixing the errors here.

This issue still seems to exist. Has any progress been made / is there anything I can try to do to help?

@seliopou has a parser combinator generator that addresses much of the GC latency. As soon as that's released, we should be able to port the parser to using it.

Just a heads up that while testing Angstrom's HTTP request parsing I got a 75% throughput increase (~20MB/s to ~35MB/s) for Lwt by increasing the input channel buffer size to 64K. The default buffer size is 4K, and I don't think it grows without you asking. Async's Reader.t starts off with a buffer size of 64K, so it was already at ~35MB/s without any modification. Growing the buffers beyond 64K doesn't seem to have any effect on throughput, but the CPU isn't pegged either, so the bottleneck's either in the network (disk for me), the kernel, or the Async/Lwt schedulers.

This is on a FreeBSD VM without any kernel tweaking so YMMV. Also, Async is using select() and Lwt are using poll(), so it's best not to put much stock in the figures outside of the relative throughput increase.

You may want to add the following line in the Lwt test server before it starts accepting requests:

Lwt_io.set_default_buffer_size 0x10000;

Growing the buffers beyond 64K doesn't seem to have any effect on throughput,

If these libraries are using Unix.{send,recv,read,write} at the lowest level then this may be due to the value of UNIX_BUFFER_SIZE. Basically IIRC call to these functions will never read/write more than that amount of data in a single read/write.

@jlouis Can you point us to the implementations of the other web servers you used to do this benchmark? In particular, I'd like to the the nim implementation, as that's the most apples-to-apples comparison.

Of course, see https://gist.github.com/jlouis/e3db53339bf4c404d6197a3b541c3c93

There are two modes for the GC: mark and sweep and reference counting, the latter being the default about a year ago when I did the benchmark. I also have the wai and go solutions hiding somewhere, though knowing the Haskell world, wai might have diverged by now.

@jlouis one more request for you. Could you please provide the raw data you used to plot the histograms for all the non-ocaml implementations? I'd like to do some comparisons to some of my performance improvements.

Yes, I'll dig them up somewhere. I think I still have them, but we should redo the project. I expect things have improved all over the place.

On Fri, Dec 9, 2016, 17:59 Spiros Eliopoulos notifications@github.com wrote:

@jlouis https://github.com/jlouis one more request for you. Could you please provide the raw data you used to plot the histograms for all the non-ocaml implementations? I'd like to do some comparisons to some of my performance improvements.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mirage/ocaml-cohttp/issues/328#issuecomment-266064795, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAWH9JkBEo7wvrG72Ofrmp4KLMaMdcyks5rGYjVgaJpZM4EEVf0 .

Managed to find them, they should be attached here. The naming could be better as some of them mention a given language, where others mention a web server in that language. You will have to search a bit around to figure out what they are in every case.

out.cohttp-async.txt out.cohttp-lwt.txt out.compojure.txt out.cowboy.txt out.cppsp.txt out.falcon.txt out.go.txt out.go2.txt out.haywire.txt out.nginx.txt out.nim_msgc.txt out.nim_rc.txt out.node.txt out.openresty.txt out.puma.txt out.rust.txt out.undertow.txt out.varnish.txt out.wai.txt

Perhaps people who have subscribed to this thread are interested in this PR: https://github.com/mirage/ocaml-cohttp/pull/819

mirage / ocaml-cohttp

Latency benchmarking cohttp shows queueing #328