Less tagging in asmcomp-optimized switches

ocamlopt optimizes switches that return only constants (from jump tables to lookup tables). Furthermore it optimizes switches where the constants are all tagged integers that happen to determine an affine transformation of the switch index; this is #8547, implemented by @smuenzel-js.

One aspect of this optimization that is slightly frustrating is that it often generates more tagging noise than we would like. This is particularly visible on identity transformations:

type t = A0 | A1 | A2 | A3

let test = function
  | A0 -> 0
  | A1 -> 1
  | A2 -> 2
  | A3 -> 3

One would reasonably expect test to be just the identity at runtime, but in fact it is not, as shown in the -dcmm output:

(function camlMicro.test_401 (param/403: val)
  (+ (<< (>>s param/403 1) 1) 1))

The present PR tweaks the affine-detection logic to instead generate the identity function for real.

(function camlMicro.test_403 (param/405: val)
  param/405)

This PR was written with the help of @clef-men, who was reading the code of kcas by @polytypic which contains a comment about this infelicity.

Obligatory micro-benchmark

(* micro.ml *)
let n =
  match int_of_string Sys.argv.(1) with
  | n -> n
  | exception _ -> failwith "An integer (iteration count) was expected as first command-line argument."

type t = A0 | A1 | A2 | A3

let test = function
  | A0 -> 0
  | A1 -> 1
  | A2 -> 2
  | A3 -> 3

let () =
  for i = 1 to n do
    ignore (Sys.opaque_identity (test (Sys.opaque_identity A0)));
    ignore (Sys.opaque_identity (test (Sys.opaque_identity A1)));
    ignore (Sys.opaque_identity (test (Sys.opaque_identity A2)));
    ignore (Sys.opaque_identity (test (Sys.opaque_identity A3)));
  done

> hyperfine -L v trunk,branch "./micro.{v}.opt 100_000_000"
Benchmark 1: ./micro.trunk.opt 100_000_000
  Time (mean ± σ):     251.5 ms ±   1.7 ms    [User: 249.3 ms, System: 1.9 ms]
  Range (min … max):   249.2 ms … 254.2 ms    11 runs

Benchmark 2: ./micro.branch.opt 100_000_000
  Time (mean ± σ):      86.3 ms ±   1.5 ms    [User: 83.9 ms, System: 2.0 ms]
  Range (min … max):    83.4 ms …  88.9 ms    33 runs

Summary
  ./micro.branch.opt 100_000_000 ran
    2.91 ± 0.06 times faster than ./micro.trunk.opt 100_000_000

How this works

The reason why there are more tagging operations than expected, in trunk, is that the affine function is computed on the tagged representations of the OCaml integers, rather than their untagged representations. In the example of the identity above, the affine function that is computed is not (0 -> 0, 1 -> 1, 2 -> 2, 3 -> 3) (slope 1, offset 0) but instead (0 -> 1n, 1 -> 3n, 2 -> 5n, 3 -> 7n) (slope 2n, offset 1n). To apply the affine function to an argument, we first untag it, and then apply the affine transformation: arg becomes 2 * (arg >> 1) + 1. The cmm_helpers smart contructors cannot rewrite this into the identity arg, because this rewriting is not correct in general (only for arg that have their least bit set).

In the PR, we instead compute the affine transformation on OCaml (untagged) integers. Instead of computing slope * untag(arg) + offset, we compute tag(slope) *caml arg +caml tag(offset), which generates nicer code and in particular simplifies into just arg when we have (slope=1, offset=0).

There is a subtlety in the PR implementation, which is that the make_switch function responsible for this optimization is called in two different contexts in trunk, sometimes (from cmmgen) on tagged integers and sometimes (from Switch) on untagged integers. If we just change make_switch to not require untagging in cmmgen, but introduce tagging in Switch to compensate, we risk adding a small performance regression in the Switch case. Instead we make the function support both calling contexts, and ensure that the minimal amount of tagging/untagging is performed -- in particular, never more than before.

How to review

The PR is designed to be read commit-by-commit -- first a refactoring that just moves code around, then a test, then the behavior change that updates the test reference.

(cc @smuenzel-js , @lthls )

I understand and approve the aim of this PR, but there are a number of things that I don't like (some of them were already problems before this PR).

The test is useful for checking that the optimisation applies, but I think that once you've checked (and showed) it works you should remove it from the PR. It's way too fragile (even with -no-locations -no-unique-ids you get unique IDs), and the fact that you've restricted it to a single very specific setup (which is already a red flag) means that people that introduce harmless changes cannot promote it unless they reproduce the same config. Plus, if you do want to keep the test, I think it's a mistake not to test the optimization with flambda.
You're removing the possibility of making affine functions out of switches with non-tagged constants. I don't think it matters very much currently, as the unboxing algorithm is not good enough to generate useful switches for boxed types, but in the future I would like to see code like this compiled to an affine function:
```
type t = A | B | C | D
let f x =
let y : Int64.t =
match x with
| A -> 0L
| B -> 2L
| C -> 4L
| D -> 6L
in
Int64.add y 1L
```
It's annoying that we have to wait until Cmm to detect affine switches on OCaml integers. If we did it in Flambda (and Clambda), not only would the code be simpler, but as a bonus Flambda and Closure would be more accurate in estimating the code size of these constant switches, meaning that an identity function with lots of cases would be trivially inlined all the time even without annotations. I do understand that we would still need this logic in Cmm for switches that are not integer-only, but at least we would only have to deal with untagged arguments here.

I still want this PR to move forward though. Please tell me how you want to take my above remarks into account and I'll review with that in mind.

The testsuite-stability problem is on my mind, and I was vaguely thinking of providing more options to control it better. (I thought of writing a post-processing step to remove the noisy parts away, of improving the implenetation of -dno-unique-ids, and/or of adding an option to show the cmm only for a specific option. I wasn't motivated enough to do any of it yet.) I agree to follow your suggestion of removing the test after approval, although I would note that in practice these tests are useful to me over time, and can protect from code-quality regressions. (I have committed (in both senses of the term) tests that store -dlambda output, and when they evolve because code-generation changes I usually find the results relevant and interesting.) This is a problem that will show up with other PRs, for example the exact same how-to-test-the-cmm problem occurs with atomic record fields.

You're removing the possibility of making affine functions out of switches with non-tagged constants.

No, I think that this possibility was not present in trunk already. If you look at extract_uconstant, it returns None for constants that have least bit 0, so neither tablification nor affinification can apply.

It's annoying that we have to wait until Cmm to detect affine switches on OCaml integers.

We could decide to move the optimization to the Lambda level, and/or we could package it as a functor like the Switch module to use it in asmcomp/ and lambda/ without too much code duplication, or maybe have it in both closure and flambda if that makes sense. (But I won't claim that thinking about Switch in both contexts is my favorite part of the code either.) I would propose to consider this outside the scope of this PR.

One thing that I forgot to mention: these optimizations (tables and affine transformations) are carefully tested by testsuite/tests/basic/switch_opts.ml, which exercises both code paths (on tagged and untagged integers). I know because intermediate versions of the PR had correctness bugs and they were reliably caught by this test. I believe that this test is a good argument for correctness of the new version; the extra -dcmm test that I added myself is there to also check that the change actually improves the generated code (slightly).

Thanks! I removed (with regret) the -dcmm tests, as we discussed.

ocaml / ocaml