Unable to reproduce ocaml-reason results

kostya-sh commented 8 years ago

I used the following commands to build the ocaml-reason test program:

opam switch 4.03.0+flambda
eval `opam config env`
opam install cohttp lwt
./build.sh

The resulting program however uses 4GB of memory during warmup. Is it expected? How much RAM is required to run the test program?

I cannot run the full test on my laptop but here is the output from wrk2 --latency -c 99 -t 3 -d 30 -R10000 'http://localhost:8080': https://gist.github.com/kostya-sh/f5a925b139a55517ccd2b72eb1544469.

  Latency Distribution (HdrHistogram - Recorded Latency)
 50.000%    1.73ms
 75.000%    2.43ms
 90.000%    3.17ms
 99.000%   11.00ms
 99.900%   24.13ms
 99.990%   28.80ms
 99.999%   31.22ms
100.000%   34.37ms

OS: Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux CPU: 4 cores

spion commented 8 years ago

The 4GB memory usage is consistent with what I observed here, yes. Haven't looked for a way to control that yet.

The difference could be due to difference in single-threaded CPU performance. Which CPU is that?

Can you try running wrk w/o a limit on the requests/s to see how many RPS it can serve on your machine? If the number is too close to 10K you may be observing throughput-caused latency increases.

edit: Also, many of the apps have a worse initial latency (while growing the hash table) compared to the steady-state latency (where items are both added and removed), so that could be a factor.

kostya-sh commented 8 years ago

4GB is roughly 16 times the size of the live data set of the application (250Mb). For comparison Go version uses 580Mb, Haskell version uses 810Mb. Maybe there is a bug in the OCaml version of the app? Or some memory hungry data structure is being used?

Unfortunately my laptop has only 4GB (time to upgrade? ;) so I cannot run the original test. However when I reduce the maximum map size to 125,000 I get the following results (wrk2 --latency -c 99 -t 3 -d 300 -R9000 'http://localhost:8080'):

  Latency Distribution (HdrHistogram - Recorded Latency)
 50.000%    1.81ms
 75.000%    2.51ms
 90.000%    3.25ms
 99.000%    6.18ms
 99.900%   13.85ms
 99.990%   23.89ms
 99.999%   31.53ms
100.000%   37.47ms

Reducing number of requests per second to 4000 gives:

50.000%    1.51ms
 75.000%    2.07ms
 90.000%    2.59ms
 99.000%    4.22ms
 99.900%    7.73ms
 99.990%   21.07ms
 99.999%   32.08ms
100.000%   34.85ms

Finally this is comparison of Go (250k), Haskell (250k) and OCaml (125k) implementations that I get on my machine (Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz): histogram

I am aware that this is not very valid comparison but I probably already spent more time analyzing this benchmark that I should have ;)

spion commented 8 years ago

I found the issue with the memory usage. Turns out, OCaml encodes all data types as machine-words, which are 64 bits on most modern machines. This means the program has been creating 8KB buffers instead of 1KB the whole time.

Switching from Array of 1K chars to Buffer resulted with a significant drop of memory usage. You should be able to reproduce the benchmark now.

spion / hashtable-latencies

Unable to reproduce ocaml-reason results #9