ocaml-flambda / flambda-backend

The Flambda backend project for OCaml
92 stars 65 forks source link

Push integer boxing allocation down through conditional branches #2251

Open alanechang opened 5 months ago

alanechang commented 5 months ago

Would be great for integer boxing to be pushed down through conditional branches more so the common path of a function doesn't allocate.

When compiling this toy example:

module Int32_u = Stdlib__Int32_u

let ( = ) = Int32.equal
let ( <= ) x y = (Int32.compare x y) <= 0

let[@inline always] check x =
  if x <= 0l
  then (Sys.opaque_identity(x); assert false);
  x

let[@inline always] boxed_func i =
  if i = 0l
  then check i
  else check i

let unboxed_func x =
  Int32_u.of_int32 (boxed_func (Int32_u.to_int32 x))

-O3 produces the following cmm:

(function{debug_test.ml:169,17-73} camlDebug_test__unboxed_func_3_7_code
     (x/492: int)
 (let
   Pbox_int/493
     (alloc{debug_test.ml:170,31-51} 2303 G:"caml_int32_ops"
       (>>s (<< x/492 32) 32))
   (if (== (+ (<< (- (> x/492 0) (< x/492 0)) 1) 1) 1)
     (if (<= x/492 0)
       (let Popaque/500 (opaque Pbox_int/493)
         (raise_notrace{debug_test.ml:170,19-52;debug_test.ml:166,7-14;debug_test.ml:161,32-44}
           G:"camlDebug_test__Pmakeblock36"))
       x/492)
     (if (<= x/492 0)
       (let Popaque/498 (opaque Pbox_int/493)
         (raise_notrace{debug_test.ml:170,19-52;debug_test.ml:167,7-14;debug_test.ml:161,32-44}
           G:"camlDebug_test__Pmakeblock36"))
       x/492))))

This code allocates on every call due to Pbox_int/493. However, the boxed variable is really only used in the assert false cases. Would be great for Pbox_int/493 to be inlined at the two reference sites.

-Oclassic output for reference:

(function{debug_test.ml:169,17-73} camlDebug_test__unboxed_func_3_3
     (x/546: int)
 (catch
   (if
     (==
       (+
         (<<
           (let int_cmp/555 (>>s (<< x/546 32) 32)
             (- (> int_cmp/555 0) (< int_cmp/555 0)))
           1)
         1)
       1)
     (if
       (<=
         (<<
           (let int_cmp/583 (>>s (<< x/546 32) 32)
             (- (> int_cmp/583 0) (< int_cmp/583 0)))
           1)
         1)
       (let
         Popaque/591
           (opaque
             (alloc{debug_test.ml:170,31-51} 2303 G:"caml_int32_ops"
               (>>s (<< x/546 32) 32)))
         (raise_notrace{debug_test.ml:170,19-52;debug_test.ml:166,7-14;debug_test.ml:161,32-44}
           G:"camlDebug_test__s5"))
       (exit 2
         (alloc{debug_test.ml:170,31-51} 2303 G:"caml_int32_ops"
           (>>s (<< x/546 32) 32))))
     (if
       (<=
         (<<
           (let int_cmp/568 (>>s (<< x/546 32) 32)
             (- (> int_cmp/568 0) (< int_cmp/568 0)))
           1)
         1)
       (let
         Popaque/576
           (opaque
             (alloc{debug_test.ml:170,31-51} 2303 G:"caml_int32_ops"
               (>>s (<< x/546 32) 32)))
         (raise_notrace{debug_test.ml:170,19-52;debug_test.ml:167,7-14;debug_test.ml:161,32-44}
           G:"camlDebug_test__s5"))
       (exit 2
         (alloc{debug_test.ml:170,31-51} 2303 G:"caml_int32_ops"
           (>>s (<< x/546 32) 32)))))
 with(2 apply_result/547: val) (load signed int32 (+a apply_result/547 8))))

@Gbury Is this something you can help with? Thanks!

Gbury commented 5 months ago

Short answer

There is no quick solution, we would need PDCE (partial dead code elimination) for that.

Now for the long answer

The main problem here is that the boxed version is used in multiple places, therefore neither to_cmm nor flambda2 will currently do anything to that allocation. to_cmm has a mechanism to duplicate and substitute primitives (such as the integer boxing primitives used in this case) that could in this very specific example be used to achieve the desired result (i.e. pushing the allocation down in each branch), and actually, when compiling in classic mode, the example is indeed simplified as wanted. However, that mechanism blindly substitutes any integer boxing at its use site, essentially replacing one allocation (the original boxing) by n allocations where n is the number of times the original allocated value is used. In general, doing this actually degrades code and makes it worse. This is why it is only activated in classic mode and for boxed numbers, where it is meant to roughly imitate what cmmgen does. It would be a bad idea to do that all the time, so we would need a criterion for when to do that optimization.

That optimisation would need to be able to move and duplicate allocations (and/or other primitives ? @lthls suggested this could be seen as an extension of CSE), but only do so when the evaluation of the primitive at runtime would not be duplicated (this is what PDCE does iiuc), so mostly when it can be pushed down in some branches but not all of them (because in that case it's not interesting). There is likely some design space around the exact heuristic to decide when to do so: whenever not all branches use the primitive result ? with a ratio/difference of these branches ? or maybe when all uses are in "cold" branches ?

In any case, optimizing such code requires a well-crafted heuristic, which itself requires a view of a while function's body. It makes sense to try and do this during the flow analysis we already do, provided that we can do it without slowing things down too much.

Lastly, let's note that such an optimization is better done during simplify/the new global pass, as is the case with most (if not all) optimisations, rather than in to_cmm which should stay as a "simple" translation: after all it is the goal and design that flambda2 terms are better for such optimisations.