ocaml-flambda / flambda-backend

The Flambda backend project for OCaml
94 stars 68 forks source link

Code generation for unboxed computations #2441

Open xclerc opened 3 months ago

xclerc commented 3 months ago

(I am not positive this issue is not a duplicate of #1783)

The following code:

let[@inline never] get_one () = #1.

let f () =
  let acc = #0. in
  let one = get_one () in
  let acc = Stdlib__Float_u.add acc one in
  let one = get_one () in
  let acc = Stdlib__Float_u.add acc one in
  let one = get_one () in
  let acc = Stdlib__Float_u.add acc one in
  let one = get_one () in
  let acc = Stdlib__Float_u.add acc one in
  let one = get_one () in
  let acc = Stdlib__Float_u.add acc one in
  let one = get_one () in
  let acc = Stdlib__Float_u.add acc one in
  let one = get_one () in
  let acc = Stdlib__Float_u.add acc one in
  let one = get_one () in
  let acc = Stdlib__Float_u.add acc one in
  acc

results in the following assembly:

        movl    $1, %eax
        call    camlT0__get_one_0_2_code@PLT
.L113:
        movsd   %xmm0, (%rsp)
        movl    $1, %eax
        call    camlT0__get_one_0_2_code@PLT
.L114:
        movsd   %xmm0, 8(%rsp)
        movl    $1, %eax
        call    camlT0__get_one_0_2_code@PLT
.L115:
        movsd   %xmm0, 16(%rsp)
        movl    $1, %eax
        call    camlT0__get_one_0_2_code@PLT
.L116:
        movsd   %xmm0, 24(%rsp)
        movl    $1, %eax
        call    camlT0__get_one_0_2_code@PLT
.L117:
        movsd   %xmm0, 32(%rsp)
        movl    $1, %eax
        call    camlT0__get_one_0_2_code@PLT
.L118:
        movsd   %xmm0, 40(%rsp)
        movl    $1, %eax
        call    camlT0__get_one_0_2_code@PLT
.L119:
        movsd   %xmm0, 48(%rsp)
        movl    $1, %eax
        call    camlT0__get_one_0_2_code@PLT
.L120:
        movapd  %xmm0, %xmm1
        xorpd   %xmm0, %xmm0
        movsd   (%rsp), %xmm2
        addsd   %xmm2, %xmm0
        movsd   8(%rsp), %xmm2
        addsd   %xmm2, %xmm0
        movsd   16(%rsp), %xmm2
        addsd   %xmm2, %xmm0
        movsd   24(%rsp), %xmm2
        addsd   %xmm2, %xmm0
        movsd   32(%rsp), %xmm2
        addsd   %xmm2, %xmm0
        movsd   40(%rsp), %xmm2
        addsd   %xmm2, %xmm0
        movsd   48(%rsp), %xmm2
        addsd   %xmm2, %xmm0
        addsd   %xmm1, %xmm0
        addq    $56, %rsp

It consumes many stack slots because the calls are pushed up, making the variables live from the calls to the end of the computation as shown by the CMM expression:

(function{t0.ml:3,6-587} camlT0__f_1_3_code (param/385: int)
 (let
   (one/386 (app{t0.ml:5,12-22} G:"camlT0__get_one_0_2_code" 1 float)
    one/388 (app{t0.ml:7,12-22} G:"camlT0__get_one_0_2_code" 1 float)
    one/390 (app{t0.ml:9,12-22} G:"camlT0__get_one_0_2_code" 1 float)
    one/392 (app{t0.ml:11,12-22} G:"camlT0__get_one_0_2_code" 1 float)
    one/394 (app{t0.ml:13,12-22} G:"camlT0__get_one_0_2_code" 1 float)
    one/396 (app{t0.ml:15,12-22} G:"camlT0__get_one_0_2_code" 1 float)
    one/398 (app{t0.ml:17,12-22} G:"camlT0__get_one_0_2_code" 1 float)
    one/400 (app{t0.ml:19,12-22} G:"camlT0__get_one_0_2_code" 1 float))
   (+f
     (+f
       (+f
         (+f (+f (+f (+f (+f 0. one/386) one/388) one/390) one/392) one/394)
         one/396)
       one/398)
     one/400)))

The pattern is similar to the one discussed in #1783, even though it does not end up with an allocation.

Callee-saved registers, or a tweaked allocation strategy for small leaf functions may help. Layout polymorphism could also help, by making acc a reference (it can be done "manually" by defining a record with a mutable field with the specific layout). As noted in #1783, it might be possible to tweak to_cmm to avoid pushing the calls up.

Gbury commented 2 months ago

Technically, to_cmm pushes the additions down rather than pushing the calls up, but that's a detail, ^^

That being said, from my point of view, the problem is how to decide when to push things down vs when not to, and from what I can tell right now, that answer seems highly dependant on the context: i.e. the number of live values, whether pushing float operations down actually makes some cmm-level optimizations possible, etc... All of these criterion are complex and not very local, and therefore, I don't think to_cmm is the right place to decide what to do: we could envision a scheme where the decision is made by flambda2 and then carried out by to_cmm (if it's easier to do there rather than in simplify).