Open jlouis opened 9 years ago
I haven't had a chance to look in detail yet as I'm travelling this week, but I've provisioned some servers to try this out when back home. This should be good to use with our ongoing multicore work as well.
It would be really useful to have a Cohttp_lwt version of this test, since then we can run it against MirageOS in Xen mode as well. How about an Lwt one that just serves the file from memory, to avoid touching the disk?
Hi, I just translated it as closely as possible to Lwt, here's the code:
NOTE: You need to have libev installed. It's probably in your package manager, or get it here. I added a line to force Lwt to use it, otherwise it'll use select() and the performance will be as horrible as Async with select().
Then run opam install conf-libev
(* This file is in the public domain *)
open Core.Std
open Lwt
open Cohttp_lwt_unix
(* given filename: hello_world.ml compile with:
$ corebuild -package lwt,cohttp.lwt hello_world.native
*)
let handler _ req _ =
let uri = Cohttp.Request.uri req in
match Uri.path uri with
| "/" -> Server.respond_string ~status:`OK ~body:"CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, <and what is the use of a book,> thought Alice <without pictures or conversations?> So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, <Oh dear! Oh dear! I shall be late!> (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge. In another moment down went Alice after it, never once considering how in the world she was to get out again. The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well. Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards......" ()
| _ -> Server.respond_string ~status:`Not_found ~body:"Route not found" ()
let start_server port () =
eprintf "Listening for HTTP on port %d\n" port;
eprintf "Try 'curl http://localhost:%d/'\n%!" port;
Server.create
~ctx:(Cohttp_lwt_unix_net.init ())
~mode:(`TCP (`Port port))
(Server.make ~callback:handler ())
let () = Lwt_engine.set ~transfer:true ~destroy:true (new Lwt_engine.libev)
let () =
Command.basic
~summary:"Start a hello world Lwt server"
Command.Spec.(empty +>
flag "-p" (optional_with_default 8080 int)
~doc:"int Source port to listen on"
) (fun port () -> start_server port () |> Lwt_unix.run)
|> Command.run
Awesome! Here is the run with cohttp-lwt:
Running 1m test @ http://172.16.0.1:8080/
8 threads and 10000 connections
Thread calibration: mean lat.: 551.730ms, rate sampling interval: 2234ms
Thread calibration: mean lat.: 640.357ms, rate sampling interval: 2439ms
Thread calibration: mean lat.: 559.485ms, rate sampling interval: 2193ms
Thread calibration: mean lat.: 563.842ms, rate sampling interval: 2242ms
Thread calibration: mean lat.: 627.723ms, rate sampling interval: 2418ms
Thread calibration: mean lat.: 630.466ms, rate sampling interval: 2414ms
Thread calibration: mean lat.: 560.149ms, rate sampling interval: 2170ms
Thread calibration: mean lat.: 628.152ms, rate sampling interval: 2449ms
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.08s 6.97s 0.97m 88.44%
Req/Sec 2.64k 230.20 3.06k 57.82%
Latency Distribution (HdrHistogram - Recorded Latency)
50.000% 4.83s
75.000% 9.32s
90.000% 15.01s
99.000% 33.06s
99.900% 44.63s
99.990% 0.94m
99.999% 0.97m
100.000% 0.97m
Detailed Percentile spectrum:
Value Percentile TotalCount 1/(1-Percentile)
206.975 0.000000 1 1.00
1207.295 0.100000 92046 1.11
1984.511 0.200000 184105 1.25
2807.807 0.300000 276059 1.43
3737.599 0.400000 368157 1.67
4825.087 0.500000 460323 2.00
5492.735 0.550000 506101 2.22
6279.167 0.600000 552285 2.50
7172.095 0.650000 598261 2.86
8179.711 0.700000 644090 3.33
9322.495 0.750000 690325 4.00
9977.855 0.775000 713302 4.44
10690.559 0.800000 736336 5.00
11476.991 0.825000 759153 5.71
12410.879 0.850000 782182 6.67
13557.759 0.875000 805251 8.00
14221.311 0.887500 816613 8.89
15007.743 0.900000 828172 10.00
16392.191 0.912500 839662 11.43
18153.471 0.925000 851176 13.33
20316.159 0.937500 862676 16.00
21528.575 0.943750 868409 17.78
22773.759 0.950000 874197 20.00
23953.407 0.956250 879890 22.86
25149.439 0.962500 885637 26.67
26509.311 0.968750 891433 32.00
27279.359 0.971875 894274 35.56
28114.943 0.975000 897168 40.00
29032.447 0.978125 900007 45.71
29982.719 0.981250 902900 53.33
30982.143 0.984375 905774 64.00
31490.047 0.985938 907207 71.11
32047.103 0.987500 908657 80.00
32653.311 0.989062 910094 91.43
33374.207 0.990625 911507 106.67
34242.559 0.992188 912961 128.00
34734.079 0.992969 913685 142.22
35291.135 0.993750 914412 160.00
35815.423 0.994531 915160 182.86
36306.943 0.995313 915828 213.33
36896.767 0.996094 916543 256.00
37257.215 0.996484 916938 284.44
37617.663 0.996875 917268 320.00
38076.415 0.997266 917630 365.71
38699.007 0.997656 917980 426.67
39616.511 0.998047 918336 512.00
40140.799 0.998242 918515 568.89
40697.855 0.998437 918692 640.00
41287.679 0.998633 918875 731.43
42303.487 0.998828 919051 853.33
45416.447 0.999023 919229 1024.00
52789.247 0.999121 919323 1137.78
53215.231 0.999219 919408 1280.00
54722.559 0.999316 919501 1462.86
55017.471 0.999414 919592 1706.67
55246.847 0.999512 919684 2048.00
55345.151 0.999561 919750 2275.56
55443.455 0.999609 919783 2560.00
55508.991 0.999658 919819 2925.71
55607.295 0.999707 919864 3413.33
55738.367 0.999756 919908 4096.00
55836.671 0.999780 919924 4551.11
56131.583 0.999805 919951 5120.00
56164.351 0.999829 919970 5851.43
56262.655 0.999854 920001 6826.67
56328.191 0.999878 920020 8192.00
56459.263 0.999890 920031 9102.22
56492.031 0.999902 920037 10240.00
56590.335 0.999915 920048 11702.86
56688.639 0.999927 920065 13653.33
56819.711 0.999939 920073 16384.00
56852.479 0.999945 920077 18204.44
56918.015 0.999951 920084 20480.00
56950.783 0.999957 920087 23405.71
57049.087 0.999963 920094 27306.67
57212.927 0.999969 920099 32768.00
57245.695 0.999973 920101 36408.89
58294.271 0.999976 920106 40960.00
58327.039 0.999979 920110 46811.43
58327.039 0.999982 920110 54613.33
58359.807 0.999985 920114 65536.00
58359.807 0.999986 920114 72817.78
58425.343 0.999988 920118 81920.00
58425.343 0.999989 920118 93622.86
58425.343 0.999991 920118 109226.67
58458.111 0.999992 920126 131072.00
58458.111 1.000000 920126 inf
#[Mean = 7078.253, StdDeviation = 6966.924]
#[Max = 58425.344, Total count = 920126]
#[Buckets = 27, SubBuckets = 2048]
----------------------------------------------------------
1200161 requests in 1.00m, 2.34GB read
Socket errors: connect 0, read 672, write 2410, timeout 2987
Requests/sec: 19997.27
Transfer/sec: 39.93MB
It has better request rate, but more timeout'ed connections. Here is the latency diagram for the same thing:
It is the same problem: We sit at 100% CPU load and can't keep up with the request rate. Comparisons to Go and Haskell/Wai isn't "fair" in the sense that they use multiple cores, and the test machine has 8 of them. Go hovers around 200% CPU load for instance, so it spreads the work over multiple cores. But the nim
solution is run on a single core with a Mark&Sweep GC, so it is definitely possible to get good single-core performance for this test.
The graph scream queueing all over it. And the queue just grows and grows over the course of the test. Even the median is fairly bad at 4 seconds.
Here is the Nim-run for comparison:
Running 1m test @ http://172.16.0.1:8080/
8 threads and 10000 connections
Thread calibration: mean lat.: 11.478ms, rate sampling interval: 17ms
Thread calibration: mean lat.: 11.394ms, rate sampling interval: 17ms
Thread calibration: mean lat.: 14.299ms, rate sampling interval: 32ms
Thread calibration: mean lat.: 11.294ms, rate sampling interval: 17ms
Thread calibration: mean lat.: 11.318ms, rate sampling interval: 17ms
Thread calibration: mean lat.: 11.391ms, rate sampling interval: 17ms
Thread calibration: mean lat.: 11.426ms, rate sampling interval: 17ms
Thread calibration: mean lat.: 11.297ms, rate sampling interval: 16ms
Thread Stats Avg Stdev Max +/- Stdev
Latency 19.01ms 76.83ms 1.85s 94.81%
Req/Sec 3.86k 1.08k 11.32k 88.60%
Latency Distribution (HdrHistogram - Recorded Latency)
50.000% 1.73ms
75.000% 2.48ms
90.000% 8.18ms
99.000% 430.59ms
99.900% 632.32ms
99.990% 974.85ms
99.999% 1.14s
100.000% 1.85s
Detailed Percentile spectrum:
Value Percentile TotalCount 1/(1-Percentile)
0.199 0.000000 1 1.00
0.955 0.100000 131257 1.11
1.173 0.200000 262446 1.25
1.355 0.300000 394085 1.43
1.529 0.400000 525270 1.67
1.730 0.500000 656565 2.00
1.846 0.550000 721901 2.22
1.973 0.600000 787672 2.50
2.115 0.650000 853762 2.86
2.275 0.700000 919264 3.33
2.479 0.750000 984476 4.00
2.609 0.775000 1017002 4.44
2.779 0.800000 1050031 5.00
2.979 0.825000 1082567 5.71
3.229 0.850000 1115503 6.67
4.487 0.875000 1148181 8.00
6.303 0.887500 1164581 8.89
8.179 0.900000 1180986 10.00
10.183 0.912500 1197366 11.43
12.871 0.925000 1213796 13.33
20.607 0.937500 1230166 16.00
54.591 0.943750 1238365 17.78
112.319 0.950000 1246570 20.00
165.759 0.956250 1254788 22.86
219.007 0.962500 1262966 26.67
262.399 0.968750 1271208 32.00
280.831 0.971875 1275288 35.56
302.591 0.975000 1279394 40.00
324.863 0.978125 1283472 45.71
355.583 0.981250 1287593 53.33
385.535 0.984375 1291693 64.00
399.615 0.985938 1293756 71.11
411.903 0.987500 1295778 80.00
423.167 0.989062 1297827 91.43
435.967 0.990625 1299898 106.67
453.631 0.992188 1301931 128.00
465.919 0.992969 1302954 142.22
480.767 0.993750 1303977 160.00
497.151 0.994531 1305012 182.86
513.535 0.995313 1306024 213.33
530.943 0.996094 1307053 256.00
538.623 0.996484 1307575 284.44
545.791 0.996875 1308092 320.00
553.471 0.997266 1308612 365.71
560.127 0.997656 1309125 426.67
564.735 0.998047 1309641 512.00
569.343 0.998242 1309884 568.89
575.487 0.998437 1310132 640.00
584.703 0.998633 1310389 731.43
601.599 0.998828 1310642 853.33
637.439 0.999023 1310894 1024.00
665.087 0.999121 1311020 1137.78
693.759 0.999219 1311147 1280.00
726.015 0.999316 1311277 1462.86
755.199 0.999414 1311406 1706.67
786.943 0.999512 1311533 2048.00
804.351 0.999561 1311596 2275.56
820.223 0.999609 1311660 2560.00
842.239 0.999658 1311726 2925.71
868.351 0.999707 1311788 3413.33
901.631 0.999756 1311852 4096.00
914.943 0.999780 1311884 4551.11
922.111 0.999805 1311916 5120.00
932.863 0.999829 1311950 5851.43
944.127 0.999854 1311980 6826.67
957.951 0.999878 1312013 8192.00
965.631 0.999890 1312029 9102.22
976.383 0.999902 1312045 10240.00
985.087 0.999915 1312060 11702.86
995.839 0.999927 1312078 13653.33
1008.127 0.999939 1312093 16384.00
1017.343 0.999945 1312100 18204.44
1030.143 0.999951 1312108 20480.00
1046.015 0.999957 1312116 23405.71
1058.815 0.999963 1312124 27306.67
1065.983 0.999969 1312132 32768.00
1073.151 0.999973 1312136 36408.89
1082.367 0.999976 1312140 40960.00
1087.487 0.999979 1312145 46811.43
1100.799 0.999982 1312149 54613.33
1121.279 0.999985 1312152 65536.00
1129.471 0.999986 1312154 72817.78
1133.567 0.999988 1312158 81920.00
1133.567 0.999989 1312158 93622.86
1146.879 0.999991 1312160 109226.67
1184.767 0.999992 1312162 131072.00
1186.815 0.999993 1312163 145635.56
1192.959 0.999994 1312164 163840.00
1214.463 0.999995 1312165 187245.71
1228.799 0.999995 1312166 218453.33
1251.327 0.999996 1312167 262144.00
1389.567 0.999997 1312168 291271.11
1389.567 0.999997 1312168 327680.00
1393.663 0.999997 1312169 374491.43
1393.663 0.999998 1312169 436906.67
1476.607 0.999998 1312170 524288.00
1476.607 0.999998 1312170 582542.22
1476.607 0.999998 1312170 655360.00
1517.567 0.999999 1312171 748982.86
1517.567 0.999999 1312171 873813.33
1517.567 0.999999 1312171 1048576.00
1517.567 0.999999 1312171 1165084.44
1517.567 0.999999 1312171 1310720.00
1850.367 0.999999 1312172 1497965.71
1850.367 1.000000 1312172 inf
#[Mean = 19.005, StdDeviation = 76.829]
#[Max = 1849.344, Total count = 1312172]
#[Buckets = 27, SubBuckets = 2048]
----------------------------------------------------------
1710642 requests in 1.00m, 3.38GB read
Requests/sec: 28511.79
Transfer/sec: 57.62MB
@jlouis Thank you very much for doing this. This is hugely appreciated.
Having thought somewhat on these numbers, there are a couple of things which are peculiar.
We have a test which runs for one minute. Some of the latencies are close to that limit, being around 55+ seconds. This means there are connections in there which are stalled for the entirety of the test case. Had the system been a queue, in strict FIFO order, this wouldn't happen. But the same latency chart shows up if you have stalling behavior or processing in LIFO stack order.
This suggests there are some connections which are not even being visited in time by the system before more work is piled on top. A typical simple functional solution would be to take the new work and then run
let old_work = List.append new_work old_work in ...
which would exhibit this problem. Though there are things happened in the upper percentiles which suggest things are somewhat off:
"Puma" is a Ruby framework, which also exhibit queuing and not being able to keep up. But it's latency curve is far "smoother" than what we see in CoHTTP/OCaml. But note that the slowest response time for Puma is around 15 seconds, not close to the 60 second mark.
I'm just wondering if there are any updates or plans?
I'm planning to do a Cohttp bug sweep later this week as part of the next Mirage release. Apologies for the delay...it's been a busy term time here in Cambridge :)
This has to be a pathology somewhere. The differences to other frameworks is simply too high for it to not matter. However, I'd suggest starting out by establishing if you can reproduce the above test case. It's been a while. If you do, note Will Glozer made wrk 4.0.1 in the meantime and Gil has been pulling changes to wrk2 from that. So it is worth looking into if they tests are any different before embarking upon fixing the errors here.
This issue still seems to exist. Has any progress been made / is there anything I can try to do to help?
@seliopou has a parser combinator generator that addresses much of the GC latency. As soon as that's released, we should be able to port the parser to using it.
Just a heads up that while testing Angstrom's HTTP request parsing I got a 75% throughput increase (~20MB/s to ~35MB/s) for Lwt by increasing the input channel buffer size to 64K. The default buffer size is 4K, and I don't think it grows without you asking. Async's Reader.t
starts off with a buffer size of 64K, so it was already at ~35MB/s without any modification. Growing the buffers beyond 64K doesn't seem to have any effect on throughput, but the CPU isn't pegged either, so the bottleneck's either in the network (disk for me), the kernel, or the Async/Lwt schedulers.
This is on a FreeBSD VM without any kernel tweaking so YMMV. Also, Async is using select()
and Lwt are using poll()
, so it's best not to put much stock in the figures outside of the relative throughput increase.
You may want to add the following line in the Lwt test server before it starts accepting requests:
Lwt_io.set_default_buffer_size 0x10000;
Growing the buffers beyond 64K doesn't seem to have any effect on throughput,
If these libraries are using Unix.{send,recv,read,write}
at the lowest level then this may be due to the value of UNIX_BUFFER_SIZE. Basically IIRC call to these functions will never read/write more than that amount of data in a single read/write.
@jlouis Can you point us to the implementations of the other web servers you used to do this benchmark? In particular, I'd like to the the nim implementation, as that's the most apples-to-apples comparison.
Of course, see https://gist.github.com/jlouis/e3db53339bf4c404d6197a3b541c3c93
There are two modes for the GC: mark and sweep and reference counting, the latter being the default about a year ago when I did the benchmark. I also have the wai and go solutions hiding somewhere, though knowing the Haskell world, wai might have diverged by now.
@jlouis one more request for you. Could you please provide the raw data you used to plot the histograms for all the non-ocaml implementations? I'd like to do some comparisons to some of my performance improvements.
Yes, I'll dig them up somewhere. I think I still have them, but we should redo the project. I expect things have improved all over the place.
On Fri, Dec 9, 2016, 17:59 Spiros Eliopoulos notifications@github.com wrote:
@jlouis https://github.com/jlouis one more request for you. Could you please provide the raw data you used to plot the histograms for all the non-ocaml implementations? I'd like to do some comparisons to some of my performance improvements.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mirage/ocaml-cohttp/issues/328#issuecomment-266064795, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAWH9JkBEo7wvrG72Ofrmp4KLMaMdcyks5rGYjVgaJpZM4EEVf0 .
Managed to find them, they should be attached here. The naming could be better as some of them mention a given language, where others mention a web server in that language. You will have to search a bit around to figure out what they are in every case.
out.cohttp-async.txt out.cohttp-lwt.txt out.compojure.txt out.cowboy.txt out.cppsp.txt out.falcon.txt out.go.txt out.go2.txt out.haywire.txt out.nginx.txt out.nim_msgc.txt out.nim_rc.txt out.node.txt out.openresty.txt out.puma.txt out.rust.txt out.undertow.txt out.varnish.txt out.wai.txt
Perhaps people who have subscribed to this thread are interested in this PR: https://github.com/mirage/ocaml-cohttp/pull/819
Hi,
So I'm in the process of doing latency-benchmarks for a couple of different web server frameworks. One of the frameworks is cohttp. I have the following code for the test:
and it is built with:
There are 2 servers, one load-generating, and one running the cohttp server, over a 1gigabit line (iperf shows 953mbit throughput), with no latency whatsoever (There is a switch between, but that is all).
The servers are linux: Linux lady-of-pain 3.19.3-3-ARCH #1 SMP PREEMPT Wed Apr 8 14:10:00 CEST 2015 x86_64 GNU/Linux
fairly recent Core i7's, no virtualization since it just slows you down.
Some sysctl.conf tuning were necessary:
Async is the recent point-release and I'm forcing Epoll support because the test just fails when it runs in a select(2) loop.
The load generator is wrk2 (github.com/giltene/wrk2) which avoids coordinated omission. Most load generators coordinate: if a connection stalls towards the system-under-test (SUT), then no further requests are issued on that connection until the first one completes. You get one "bad" latency number. In wrk2, the rate is kept stable: at an interval, new requests are planned to be issued on the stalling connection, so one can get more realistic latencies from stalled connections.
We run the following:
That is, 10k connections, and a rate of 30k req/s which means 3 req/s per connection. Such a run yields the following latency histogram:
There are a few timeout errors, but note we can't really keep the request rate at what we want, as it is at 12k req/s. The single core on the SUT maxes out, and queuing buildup happens. I've attached the histogram plot (Note the x axis, which is compressed quite a lot so you can see the high latencies) and also included
nim
which is another single-threaded solution making for a fair comparison. Thego
solution uses all 8 cores, so it has considerably less work to do per core, i.e., that comparison isn't really fair. Thewai
solution is a Haskell framework, but it also utilizes all 8 cores.I have yet to try an Lwt solution with cohttp. It may perform totally different, but I'm not sure from where to cut it. I thought about starting off of the file server examples by Hannes, but I'm not really sure this is the way to go. Also, tuning of the OCaml system is something I'm very interested in.