Optimize forwarding in eq and ord plugins

sim642 commented 3 years ago

After switching a large project from manual equal and compare implementations to ones derived by ppx_deriving (https://github.com/goblint/analyzer/pull/227), we noticed significant performance decreases in in those functions (https://github.com/goblint/analyzer/issues/265).

Benchmarking the generated equal functions directly (https://github.com/goblint/analyzer/blob/a21f33511183074c693c15269990210b283ae045/bench/deriving/benchEq.ml, executable independently from our project) showed up to 3 times slowdown. For example, when the old manual implementation

let equal (x1, y1) (x2, y2) = Int.equal x1 x2 && String.equal y1 y2

was replaced with one derived for Int.t * String.t. The modules Int and String are just chosen as example, we often use functors where the modules come abstractly from arguments, so using primitive types isn't an option for deriving.

Inefficient derived code

Currently, the following implementation is derived:

let equal =
  let __1 () = String.equal
  and __0 () = Int.equal in
  ((let open! ((Ppx_deriving_runtime)[@ocaml.warning "-A"]) in
    fun (lhs0, lhs1) ->
      fun (rhs0, rhs1) ->
        ((fun x -> (__0 ()) x) lhs0 rhs0) &&
          ((fun x -> (__1 ()) x) lhs1 rhs1))
  [@ocaml.warning "-A"])

The benchmarks confirm that this literal code performs the same as the derived one.

Optimizations

This PR suggests two optimizations for forwarding the equal calls.

Avoid unnecessary eta-expansion

PR #55 introduced eta-expansion to forwarded calls. This was necessary since derived implementations for mutual recursion in the generated let rec must be statically constructive (https://ocaml.org/manual/letrecvalues.html), so expanding a forwarding call into a fun works around the restriction.

The problem in this case is that the eta-expansion is added to all forwarding calls, including those deeper in the resulting equal implementation, where the restriction doesn't apply any more because the outermost expression is a fun already. AFAIK, eta-expansion is only needed at the top level if it's not a fun already (most cases of expr_of_typ are anyway). Therefore this PR implements exactly that. The linked benchmarks show a 50%…300% speedup from this change.

Avoid unnecessary closures in quoting

Issue #57 and commit https://github.com/ocaml-ppx/ppx_deriving/commit/c3bee7b96fb048c1a2431629109b532637c30df7 introduced quoting to forwarded calls. The quoting mechanism introduces functions with () argument. There is no explanation why, but I'm guessing it's to be lazy: in case the quoted expression involves some computation, then don't perform it immediately at the beginning (it might not be necessary due to short-circuiting, etc) but perform it each time it's actually used, just like if the quoted expression where independently at each use.

The problem in this case is that forward calls being quoted are just identifier expressions, which AFAIK cannot be performing any computation on their own. Therefore this PR implements a special case of quoting just identifiers to be without the () argument. The linked benchmarks show a 200%…400% speedup from this change on the pair examples. Just forwarding to a single function doesn't seem to gain any speedup, nor slowdown.

Efficient derived code

As a result of these changes, the new derived code for the example above is:

let equal =
  let __1 = String.equal
  and __0 = Int.equal in
  ((let open! ((Ppx_deriving_runtime)[@ocaml.warning "-A"]) in
      fun (lhs0, lhs1) ->
        fun (rhs0, rhs1) -> (__0 lhs0 rhs0) && (__1 lhs1 rhs1))
    [@ocaml.warning "-A"])

And overall the derived implementations seem to be at least 3 times faster, bringing them mostly on par with the manual implementations.

gasche commented 3 years ago

Thanks for the very clear PR that looks high-quality (I haven't looked at the code yet).

A quick comment:

The quoting mechanism introduces functions with () argument. There is no explanation why, but I'm guessing it's to be lazy: in case the quoted expression involves some computation, then don't perform it immediately at the beginning (it might not be necessary due to short-circuiting, etc) but perform it each time it's actually used, just like if the quoted expression where independently at each use.

In general, suppose we are transforming an expression C[e] (a subexpression e inside a wider context C[...]), where e is some source expression that we want to transform into a desugared/transformed piece of code e´, where e' should be evaluated in the current scope (for example, it refers to some identifier x that we know scopes over C[e]. Transforming this into C[e'] would be incorrect (un-hygienic), because the context C[...] written by the user could contain binders that shadow this x with a different definition; for example C[...] could be fun x -> ... or let open Foo in ..., where Foo defines an x. So instead one may want to translate this term into let __private_identifier = e' in C[__private_identifier], but this is incorrect because it changes the order of evaluation (or the number of types e' is evaluated, etc.). On the other hand, the translation let __private_identifier () = e' in C[__private_identifier ()] is always correct, scope-wise and evaluation-wise. Of course, as you point out it introduces constant-factor inefficiencies in general, and is not necessary when e´ is known to be a constant or an identifier (its evaluation order does not matter, and evaluating it when it is unnecessary does not degrade performances).

(Other translations are possible, such as let c x = C[x] in c e', or let x = e' and c x = C[x] in c x, which do not require generating "private" identifiers; but we digress, and those introduce a function call that is harder to simplify away with local reasoning.)

gasche commented 3 years ago

I have now read the code and I believe it is correct, thanks again!

One minor remark: you left some comments (* eta-expanded if outermost *) in two places in eq and ord each. I'm not sure those leftover comments (in one case a trace of a previous workaround, in the other a bug opportunity that you noticed) really help code understanding. Putting myself in the shoes of a newcomer, it's not clear what those comments mean, and they seem redundant with the comment that exists at the place where eta-expansion does actually take place. I would remove those extraneous comments, and enrich the comment on the actual eta-expansion site with a reference to the present PR that contains a detailed description for more context.

sim642 commented 3 years ago

Thanks for the feedback! I updated both the eta-expansion and the quoter comments to provide some explanation for the special optimized case.

in the other a bug opportunity that you noticed

Indeed, when trying to understand what this eta-expansion and quoting is about, I realized that eta-expansion on the outermost level was missing if a non-fun expression was explicitly provided via [@equal …] or [@compare …]. It took me a while to engineer the added tests because it only matters in the extremely obscure case when such expression refers to another derived one in the same mutually recursive definition (otherwise it's statically constructive by the simplest case) and that requires explicitly referencing the other to-be derived function.

sim642 commented 3 years ago

A couple of additional remarks:

Pexp_ident probably isn't the only case where the quote optimization can be done. For example, I think Pexp_constant and maybe some other combinations could also allow it. But I didn't want to attempt to implement a function to determine the purity of an expression by syntax and Pexp_ident covers the most likely use case of quoting.
This repository doesn't contain any benchmark code but if there's interest, I could also include that to continue living in this repository, since as it turns out small optimizations to derived code can make a big impact.

gasche commented 3 years ago

This repository doesn't contain any benchmark code but if there's interest, I could also include that to continue living in this repository, since as it turns out small optimizations to derived code can make a big impact.

Yes, but how would you robustly check for performance regression, in a way that does not depend on the system on which the tests are run? One way to do this would be to count bytecode instructions, but I don't know of an easy, practical way to track this. (You can run a bytecode binary with the instrumented runtime, ocamlrund -t foo.bc | wc -l, but it's a bit cumbersome to setup.)

gasche commented 3 years ago

Pexp_ident probably isn't the only case where the quote optimization can be done. For example, I think Pexp_constant and maybe some other combinations could also allow it. But I didn't want to attempt to implement a function to determine the purity of an expression by syntax and Pexp_ident covers the most likely use case of quoting.

I wondered the same and I agree. Your optimization makes sense because it corresponds to a well-understood use case (we are not optimized, rather de-pessimizing). I would wait until a clear use-case comes for constants (or functions, etc.) before going further.

sim642 commented 3 years ago

Yes, but how would you robustly check for performance regression, in a way that does not depend on the system on which the tests are run? One way to do this would be to count bytecode instructions, but I don't know of an easy, practical way to track this. (You can run a bytecode binary with the instrumented runtime, ocamlrund -t foo.bc | wc -l, but it's a bit cumbersome to setup.)

Automating it is nontrivial indeed. I suppose one way would be to have an explicit copy of the last known good (i.e. with expected performance) derived code in the benchmark and have another one be derived via the current deriver implementation. Then check if the results significantly differ (beyond some margin of error) to automatically catch pessimization or optimization. And when intentional optimizations are made, just update the explicit copy from there on.

Anyway, this goes beyond what currently would be needed, just a thought about how it might be done.

gasche commented 3 years ago

Yes, having to update the "baseline" on a semi-regular basis is okay, but we want to avoid flaky tests that bring the CI down irregularly. If you want to try to work something out and submit a PR, that would be warmly welcome, but I am not sure how to get a robust test -- besides the ocamlrund trick I mentioned.

(Maybe a simpler approach would be to just store the -dparsetree output of the transformation, so that the test fails and has to be promoted whenever the generated code changes. When people change this test output we can think about the performance difference, possibly by running some benchmarks.)

ocaml-ppx / ppx_deriving