Imperative loops and locally allocated stuff

patrick-nicodemus commented 2 days ago

I have downloaded the "bleeding edge" fork of Flambda2 here - https://github.com/janestreet/opam-repository/tree/with-extensions

Here is some code to add up the numbers from 1 to n using floating point arithmetic.

let sum n =
  let acc = ref 0. in
  let k = ref n in
  while !k != 0 do
    acc := !acc +. float_of_int !k;
    k := !k - 1
  done;
  !acc

The normal OCaml compiler generates the following cmm code which contains a local mutable accumulator variable:

(function{bin/main.ml:11,8-138} camlDune__exe__Main.sum_270 (n/272: val)
 (let_mut acc/542: float 0.
   (let_mut k/274: int n/272
     (catch
       (catch rec (exit 5) with(5)
         (if (!= k/274 1)
           (seq (assign acc/542 (+f acc/542 (floatofint (>>s k/274 1))))
             (assign k/274 (+ k/274 -2)))
           (exit 4))
         (exit 5))
     with(4) []) (alloc 1277 acc/542))))

The "bleeding edge" compiler generates the following cmm code, which allocates every loop:

       (catch rec (exit 8 L:"camlDune__exe__Main__float14" 2000001 0.) with(8
         acc/579: val k/580: int unboxed_float/581: float)
         (if (!= k/580 1)
           (let prim/584 (+f unboxed_float/581 (int->float (>>s k/580 1)))
             (exit 8 (alloc{bin/main.ml:15,11-34} 1277 prim/584) (+ k/580 -2)
               prim/584))
           (exit 7 acc/579)))

Is there a way to make this allocation local?

patrick-nicodemus commented 2 days ago

I also have a related question about how to get good performance. I don't understand the performance characteristics I observe in the following code:

let pi_approx : int -> float = fun n ->
  let rec pi_sum k acc=
    (if k = 0 then acc else pi_sum (k-1)
         (acc +. 1./. (let a = float_of_int k in a *. a))) in
  Float.sqrt @@ (6. *. pi_sum n 0.)
;;

let pi_approx_while n =
  let acc = ref 0. in
  let k = ref n in
  while !k != 0 do
    acc := !acc +. 1./. (let a = float_of_int !k in a *. a);
    k := !k - 1
  done;
  Float.sqrt @@ 6. *. !acc

let () =
  let t0 = Unix.gettimeofday () in
  let b = (pi_approx 1_000_000) in
  let t1 = Unix.gettimeofday () in
  Printf.printf "%f %f\n" (t1 -. t0) b;

  let t0 = Unix.gettimeofday () in 
  let s = pi_approx_while 1_000_000 in
  let t1 = Unix.gettimeofday () in
  Printf.printf "%f %f\n" (t1 -. t0) s
;;

On my machine, the standard ocamlopt compiler has a runtime of 3.7ms for the first one and 2.3ms for the second one. But the "bleeding edge" compiler has a runtime of 7ms for the first one and 5.4ms for the second one, which seems like a significant difference to me. I looked at the cmm code again and I didn't see allocations in the tight loop in the "bleeding edge" compiler code, how can I understand where the difference in time is coming from? (I annotated the pi_approx loop with exclave_ and local_ annotations and it is actually slower when I use these features.)

lthls commented 2 days ago

Is there a way to make this allocation local?

It's a known issue: if the last result returned by a loop is returned/used directly, unboxing fails (in part because unboxing would introduce an unnecessary boxing in the case where the loop is not taken at all, but mostly because we haven't gotten around to doing it properly yet). There is a workaround, which is to write !acc +. 0. at the end.

Regarding your second post, here are some stats from perf (with only pi_approx, although something similar happens with the second function):

$ perf stat -d ./compute_pi_5.2
0.004236 3.141592

 Performance counter stats for './compute_pi_5.2':

              5.48 msec task-clock                       #    0.923 CPUs utilized          
                 1      context-switches                 #  182.627 /sec                   
                 1      cpu-migrations                   #  182.627 /sec                   
               784      page-faults                      #  143.180 K/sec                  
        15,276,784      cycles                           #    2.790 GHz                      (58.14%)
        22,995,216      instructions                     #    1.51  insn per cycle         
         4,071,826      branches                         #  743.627 M/sec                  
            17,035      branch-misses                    #    0.42% of all branches        
         4,330,415      L1-dcache-loads                  #  790.853 M/sec                  
           368,035      L1-dcache-load-misses            #    8.50% of all L1-dcache accesses  (41.86%)
$ perf stat -d ./compute_pi_flambda2
0.011873 3.141592

 Performance counter stats for './compute_pi_flambda2':

             12.99 msec task-clock                       #    0.967 CPUs utilized          
                 0      context-switches                 #    0.000 /sec                   
                 0      cpu-migrations                   #    0.000 /sec                   
               180      page-faults                      #   13.852 K/sec                  
        31,458,941      cycles                           #    2.421 GHz                      (24.72%)
        16,366,282      instructions                     #    0.52  insn per cycle           (55.43%)
         2,487,257      branches                         #  191.411 M/sec                    (86.19%)
            13,462      branch-misses                    #    0.54% of all branches        
         1,647,826      L1-dcache-loads                  #  126.811 M/sec                  
             9,544      L1-dcache-load-misses            #    0.58% of all L1-dcache accesses  (75.28%)

The important point is that the Flambda2 version executes significantly less instructions, and does much fewer memory loads (in particular cache misses), but despite that is still much slower. This looks like a loop alignment issue: if I add [@inline never] to pi_sum, the code generated becomes equivalent to the one from 5.2, and I get the same performance (including extra allocations). But if, in addition, I add +. 0. to the result of pi_sum to remove the allocations, I get something even worse, where the number of allocations and cycles have gone down, but the frequency has dropped significantly (to less than 1GHz) the the final result is a much slower program.

Bottom line: processors are weird, and micro-benchmarks are very likely to be impacted by the CPU's heuristics in unpredictable ways.

patrick-nicodemus commented 2 days ago

@lthls Thank you very much for your help. For pi_approx I was able to get good performance by using float# instead of float and replacing the operations with Stdlib_upstream_compatible.Float_u operations, and I am happy with this. I also am able to reproduce your observation with [@inline never] and this makes a lot of sense. I will take away the lesson that I should be humble about interpreting the results of these micro-benchmarks.

lthls commented 2 days ago

You're welcome. And thank you for trying out our code and reporting your findings; it's always hard to estimate the impact of our optimising decisions on real code, and feedback like yours is always valuable (even if it only points out things we already know about).

chambart commented 21 hours ago

I looked a bit more into your pi_approx example. And this is a very 'fun' one !

The difference between the boxed and unboxed version is that in the unboxed version, the processor is able to follow all the dependencies between all computations because they happen in register, while in the boxed one it can't.

Here is the loop in the unboxed version.

.L105:
    cmpq    $1, %rax
    jne .L109
...
.L109
    movq    %rax, %rbx
    sarq    $1, %rbx
;;      pxor    %xmm0, %xmm0 (added instruction)
    cvtsi2sdq   %rbx, %xmm0
    mulsd   %xmm0, %xmm0
    movsd   .L115(%rip), %xmm2
    divsd   %xmm0, %xmm2
    movapd  %xmm1, %xmm0
    addsd   %xmm2, %xmm0
    addq    $-2, %rax
    movapd  %xmm0, %xmm1
    jmp .L105

The processor computes the dependencies and assumes that there is a dependency between the result of the conversion from int to float cvtsi2sdq and the division divsd.

Hence, it needs to wait for the division to finish before starting the conversion and reciprocally. The problem being that the division and the conversion are both instructions with very high latencies. But of course this is completely useless: cvtsi2sdq does not really use its %xmm0 argument ! But it seems that Intel didn't bother doing that optimisation. It is possible to help the processor by adding an instruction that makes it obvious that this dependency doesn't exist: pxor that clears the register.

Just this simple change turns a program with ~0.5 instructions per cycles into one that runs at ~3.5 ipc.

We should probably consider adding the clearing instruction systematically on cvtsi2sdq.

Thank you for the report !

ocaml-flambda / flambda-backend

Imperative loops and locally allocated stuff #3293