ocaml / dune

A composable build system for OCaml.
https://dune.build/
MIT License
1.6k stars 398 forks source link

Fix exponential blow up in Memo #4097

Closed rgrinberg closed 3 years ago

rgrinberg commented 3 years ago

Memoized fiber functions exhibit an exponential blow up in exceptions if they raise. The test case in https://github.com/ocaml/dune/pull/4036 demonstrates this. There were a couple of suggestions on how to fix this, but I'm having trouble finding them in the bug tracker. In any case, we must fix this before releasing 2.8

gasche commented 3 years ago

I had a look at this out of curiosity (I was coming to report an issue but the name of this one is very catchy!).

The blowup example is a recursive function that calls itself twice at each call. Without memoization, it would take exponential time. It is also a asynchronous/fiber function, the two calls are made in two forked fibers, and the leaf case raises. The current execution semantics of fibers specifies that all fibers will run (even in presence of errors in some fibers), and all exceptions will be collected. As a consequence, the non-memoized version would take exponential time and return an exponential number of exceptions.

The fact that the memoized version returns an exponential list of exceptions is arguably not a bug of memoization: it is faithful to the original computation. The part that is frustrating is that the memoized computation graph is linear (thanks to sharing), so we would expect a linear runtime, but we get an exponential runtime just building all these exception results.

Even if is not a bug, how could it be fixed? I see several broad approaches:

  1. You could decide to keep reporting all exceptions but share them: represent the exceptions not as a list, but with a tree. If two identical nodes of the original computation graph return the same subtree of memoized exceptions, then the resulting exception tree will have an exponential number of leaves, but a linear size in memory thanks to sharing. One problem with this sharing-based solution is that it can be very fragile to further computations on the structure (which may lose sharing). If you naively "map" on such a tree, you duplicate the subtrees again and you are back. And the current implementation does a lot of mapping, to add backtraces etc, and there is a lot of back-and-forth between the memoization runtime and the fiber runtime. There are ways to keep sharing (have an "abstract map" as another constructor of the datatype instead of mapping eagerly, etc.), but I wouldn't trust this complex system to be able to preserve this sharing, and mistakes in forgetting sharing tend to be silent and annoying to track with. Not recommended.

  2. You could decide to weaken the semantics of fibers to not guarantee that all sub-fiber exceptions are reported, so that you keep a subset of the exceptions (typically: the first one, either in source order or in actually-executed order). (Or provide different forking operators with this new semantics, etc.) This seems rather easy to do and could be applicable to more places than just Memo. But I guess there are very good reasons why you want to list all errors in some cases, that may also want to use memoization.

  3. You could decide to return a lazy sequence of exceptions, so that the consumer decides how many exceptions to print/display (eg. Exception.t Seq.t). This is sort of a "dumb" version of (1), where sharing is implicit in the lazy thunks inside say Seq.append calls. It makes it much easier to preserve sharing (sharing is never destructed on the parts of the sequence that are not forced), but it may interact badly with the imperative soup in the fiber scheduler etc. (Is it correct to lazily map the function that adds backtrace information, if this backtrace is maintained in mutable references with backtracking? Also a problem for 1.)

  4. You could decide that when f is a memoized function, you are allowed to share errors coming from identical sub-calls, breaking the correspondence with the non-memoized version. Instead of ('a, Exn_with_bt.t list) result, the type 'a Value.t of memoized-function results would turn into ('a, Exn_with_bt.t list Id.Map.t) result, mapping computation-node ids (of the current call, or a subcall) to a list of exceptions that was raised directly during this node's computation (not by a subnode). Looks like the simplest approach.

(Note: I'm not trying to say that I want to work on this. It's just that I thought about this problem out of curiosity and, before I bailout to move to something else, I thought I could dump my notes in case they help.)

rgrinberg commented 3 years ago

I found the original discussion about fixing this bug. Andrey's suggestion is similar to Gabriel's 4th point.

rgrinberg commented 3 years ago

4106 tries to fix this.

aalekseyev commented 3 years ago

@gasche, thanks for those notes, they are very comprehensive. Indeed by choosing (4) we're sacrificing the equivalence between memoized and non-memoized computations with regards to number of exceptions.

The reasons I think that's OK are:

  1. Pragmatically, that's what the users want the build system to do: if there's e.g. a compilation error, it should be reported once instead of once per dependency path that lead to this computation being required. (the reality is a bit more complicated because we have a separate deduplication mechanism at play)
  2. One can think of exceptions as side-effects, and memoization will necessarily deduplicate side-effects. We can then recover predictable semantics by requiring that all side-effects are idempotent, so deduplication is not observable. We can think of building as an idempotent effect (building a file again changes nothing, as long as build rules are deterministic). We can equally consider computations up to exception deduplication, which makes them idempotent as well. I guess what I'm trying to say here is that the same happens for building effects, so continuing the same pattern for exceptions seems reasonable.