Refactor dequeue to return option

Lupus commented 5 years ago

When compiling to JS (via js_of_ocaml) local exceptions used for control flow are getting very expensive (V8 at least). This PR tries to alleviate some of the places, where functions with local exceptions appeared in the top of Node.js profile.

See #835 in js_of_ocaml repo.

Lupus commented 5 years ago

Hm, isn't that option nearly guaranteed to not escape the minor heap? In this case allocation will only cost as much as addition to machine word in memory (which will likely be in CPU cache if not in some register). GC of minor heap is just move of this machine word back, promoting all live objects to major heap, which is the most expensive part. But may be I'm just plain wrong on this as I'm not an export in OCaml GC.

Explicit checks for queue lengths would look a bit ugly... Do you have some way of profiling this? I'm not experience with proper tools to do profiling for OCaml code either.

Lupus commented 5 years ago

I've ran some benchmarks with httpaf (while troubleshooting httpaf/139) and looks like this PR does not introduce a slowdown, but actually improves performance a bit. Also it fixes the stack overflow, described in linked httpaf issue. Here are some numbers:

Each run is curl launch performing POST of large file to lwt_echo_post example from httpaf repository:

curl -H"Expect:" -XPOST -d @very_big_file -o/dev/null  http://0.0.0.0:8080/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 7240M    0 3620M  100 3620M   513M   513M  0:00:07  0:00:07 --:--:-- 1037M

Development and release correspond to dune profiles used to build the examples.

seliopou commented 5 years ago

I may be misunderstanding your column labels, and the units that are being reported. But assuming that "patched" is this PR, and the numbers are latencies, then it looks like the patched version is slightly slower.

In any case, the point of avoiding allocation is not to avoid the cost of allocation, but to avoid the cost of garbage collection. The only way to collect less it to generate less garbage, which was my explicit goal when writing the Deque module. I'd like to preserve that property.

Lupus commented 5 years ago

Sorry for poorly presented benchmark results, columns are megabytes per second, as reported in curl output (columns "Average Dload" or "Speed Upload"). "patched" is this PR, "vanilla" is e82d6a0.

I've added printing of GC stats at server exit and compared the numbers, option-based version indeed increases the amount of work for GC. I've implemented the approach that you suggested as follows:

diff --git a/lib/faraday.ml b/lib/faraday.ml
index f25db14..4429e7a 100644
--- a/lib/faraday.ml
+++ b/lib/faraday.ml
@@ -39,8 +39,6 @@ type 'a iovec =
   ; off    : int
   ; len    : int }

-exception Dequeue_empty
-
 module Deque(T:sig type t val sentinel : t end) : sig
   type elem = T.t

@@ -95,13 +93,10 @@ end = struct
     t.back <- t.back + 1

   let dequeue_exn t =
-    if is_empty t then
-      raise Dequeue_empty
-    else
-      let result = Array.unsafe_get t.elements t.front in
-      Array.unsafe_set t.elements t.front sentinel;
-      t.front <- t.front + 1;
-      result
+    let result = Array.unsafe_get t.elements t.front in
+    Array.unsafe_set t.elements t.front sentinel;
+    t.front <- t.front + 1;
+    result

   let enqueue_front e t =
     (* This is in general not true for Deque data structures, but the usage
@@ -367,21 +362,22 @@ let yield t =
   t.yield <- true

 let rec shift_buffers t written =
-  try
+  if not (Buffers.is_empty t.scheduled) then begin
     let { len; _ } as iovec = Buffers.dequeue_exn t.scheduled in
     if len <= written then begin
       shift_buffers t (written - len)
     end else
       Buffers.enqueue_front (IOVec.shift iovec written) t.scheduled
-  with Dequeue_empty ->
+  end else begin
     assert (written = 0);
     if t.scheduled_pos = t.write_pos then begin
       t.scheduled_pos <- 0;
       t.write_pos <- 0
     end
+  end

 let rec shift_flushes t =
-  try
+  if not (Flushes.is_empty t.flushed) then begin
     let (threshold, f) as flush = Flushes.dequeue_exn t.flushed in
     (* Edited notes from @dinosaure:
      *
@@ -401,8 +397,7 @@ let rec shift_flushes t =
     if t.bytes_written - min_int >= threshold - min_int
     then begin f (); shift_flushes t end
     else Flushes.enqueue_front flush t.flushed
-  with Dequeue_empty ->
-    ()
+  end else ()

 let shift t written =
   shift_buffers t written;

Then I've ran 3 tests with wrk to measure latencies. Below are outputs of wrk, followed by output of GC.print_stat of the server process.

Vanilla version

$ wrk -t1 -c 200 -d 10s  http://localhost:8080/                                                       
Running 10s test @ http://localhost:8080/                                                                                                                                    
  1 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev                                                                                                                          
    Latency     3.58ms  452.77us   8.50ms   81.94%                                                                                                                           
    Req/Sec    55.92k     2.93k   59.07k    81.00%
  555941 requests in 10.01s, 1.08GB read
Requests/sec:  55546.71
Transfer/sec:    110.93MB

minor_collections: 2297
major_collections: 384
compactions:       0

minor_words:    551697546
promoted_words:  67680984
major_words:     67682138

top_heap_words: 491520
heap_words:     491520
live_words:     239168
free_words:     252333
largest_free:   208523
fragments:          19

live_blocks: 66162
free_blocks: 767
heap_chunks: 1

Patched (option)

$ wrk -t1 -c 200 -d 10s  http://localhost:8080/
Running 10s test @ http://localhost:8080/
  1 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.53ms  445.94us   6.88ms   81.91%
    Req/Sec    56.71k     3.48k   60.36k    84.00%
  564045 requests in 10.01s, 1.10GB read
Requests/sec:  56371.60
Transfer/sec:    112.57MB

minor_collections: 2339
major_collections: 390
compactions:       0

minor_words:    562042048
promoted_words:  68913676
major_words:     68914830

top_heap_words: 491520
heap_words:     491520
live_words:     232971
free_words:     258425
largest_free:   124108
fragments:         124

live_blocks: 64418
free_blocks: 39812
heap_chunks: 1

Patched (is_empty call)

$ wrk -t1 -c 200 -d 10s  http://localhost:8080/
Running 10s test @ http://localhost:8080/
  1 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.68ms  481.83us   9.13ms   75.55%
    Req/Sec    54.59k     3.33k   58.81k    77.00%
  542740 requests in 10.01s, 1.06GB read
Requests/sec:  54234.01
Transfer/sec:    108.31MB

minor_collections: 2240
major_collections: 373
compactions:       0

minor_words:    538418413
promoted_words:  66124974
major_words:     66126128

top_heap_words: 491520
heap_words:     491520
live_words:     206070
free_words:     285385
largest_free:    30362
fragments:          65

live_blocks: 56328
free_blocks: 53155
heap_chunks: 1

I'm not sure why, but the direct is_empty check is the slowest, although it generates least amount of work for GC.

inhabitedtype / faraday

Refactor dequeue to return option #51