Compare with staged stream fusion (strymonas)?

gasche commented 2 years ago

Strymonas is a project to implement efficient and predictable stream fusion using BER MetaOCaml -- and the Lightweight Modular Staging approach in Scala. I would expect such an approach to be an interesting baseline to compare the capacities of other approaches to perform stream fusion. Would you possibly consider measuring the performance of Strymonas on your benchmark example(s)?

(This requires a different opam switch as BER-metaocaml is non-standard, and you cannot hope to reimplement that part yourself. So that would not be included in the automatic benchmark results directly, but at least you could have the sources to run those tests somewhere in the repository, and mention the ballpark results in the README.)

rizo commented 2 years ago

Thanks for suggesting this, I've been meaning to add strymonas to the list for some time now.

I did actually measure the performance of strymonas using ppx_stage (I simply copied the generated code for certain operations). From what I remember Iter and Streaming.Stream were within 10% of strymonas, which, of course, was the fastest of them all.

Unlike Streaming.Stream, the current version of strymonas does not manage streaming resources, but it is a very good baseline, as you suggested.

I'll try to find some time to add strymonas to the benchmarks and maybe even automate the execution with ppx_stage and dune.

rizo commented 2 years ago

Hi @gasche,

I have added some initial ppx_stage-powered Strymonas results and the results are... interesting!

Benchmark setup

First, allow me to present the details of the benchmarking setup:

The code for the ppx_stage generator can be found here: https://github.com/rizo/streams-bench/tree/master/ppx_stage_strymonas. More specifically:
- The benchmark test is here: https://github.com/rizo/streams-bench/blob/master/ppx_stage_strymonas/src/Strymonas_bench.ml
- The expanded version is here: https://github.com/rizo/streams-bench/blob/master/ppx_stage_strymonas/src/Strymonas_bench.ml.result
I created a fully standalone module Standalone.ml to make it easier to reproduce the benchmarks and make modifications.
- You can compile it directly with ocamlopt as in https://github.com/rizo/streams-bench/blob/master/Makefile#L2
Finally, there are 4 benchmarks in Standalone.ml:
- Iter - our well-known push iterator used as a reference point;
- Strymonas - I simply inlined the expanded code here for now;
- Streaming_safe - is the current implementation of Streaming.Stream from my streaming library;
- Streaming_core - is a simplified version of above that does not provide resource safety and uses mutability. My intention with this model was to see how close I could get to Strymonas.

The Strymonas benchmark code is:

let (--) i j =
  Strymonas.unfold (fun x ->
    [%code if [%e x] = [%e j] then None else Some ([%e x], [%e x] + 1)]) i

let strymonas_test () =
  let open Strymonas in
  [%code 0] -- [%code 1024]
  |> map (fun x -> [%code [%e x] + 1])
  |> filter (fun x -> [%code [%e x] mod 3 = 0])
  |> take [%code 512]
  |> flat_map (fun x -> x -- [%code [%e x] + 30])
  |> fold (fun x y -> [%code [%e x] +  [%e y]]) [%code 0]

The Streaming_core model is implemented as:

type +'a t = { fold : 'r. 'r -> ('r -> 'a -> 'r) -> bool ref -> 'r }
(** A push-based fold stream with early termination indicated by the boolean reference. *)

Benchmark results

Now, these are the results of my first benchmark (on a M1 MacBook, 4.14.0+flambda):

ocamlfind opt \
        -O3 \
        -package benchmark \
        -o ./standalone -linkpkg src/Standalone.ml
Throughputs for "Streaming_core", "Strymonas", "Streaming_safe", "Iter" each running 2 times for at least 3 CPU seconds:
Streaming_core:  3.14 WALL ( 3.14 usr +  0.01 sys =  3.14 CPU) @ 27448.49/s (n=86300)
                 3.15 WALL ( 3.14 usr +  0.01 sys =  3.15 CPU) @ 27413.19/s (n=86300)
     Strymonas:  3.15 WALL ( 3.15 usr +  0.01 sys =  3.15 CPU) @ 42864.15/s (n=135151)
                 3.15 WALL ( 3.15 usr +  0.01 sys =  3.15 CPU) @ 42864.92/s (n=135151)
Streaming_safe:  3.19 WALL ( 3.19 usr +  0.00 sys =  3.19 CPU) @ 17039.39/s (n=54377)
                 3.19 WALL ( 3.18 usr +  0.01 sys =  3.19 CPU) @ 17061.65/s (n=54377)
          Iter:  3.15 WALL ( 3.14 usr +  0.01 sys =  3.15 CPU) @ 19451.36/s (n=61268)
                 3.15 WALL ( 3.14 usr +  0.01 sys =  3.15 CPU) @ 19457.11/s (n=61268)
                  Rate     Streaming_safe        Iter Streaming_core   Strymonas
Streaming_safe 17051+-11/s             --        -12%           -38%        -60%
          Iter 19454+- 3/s            14%          --           -29%        -55%
Streaming_core 27431+-17/s            61%         41%             --        -36%
     Strymonas 42865+- 0/s           151%        120%            56%          --

As expected, Strymonas is really fast; Streaming_core is 36% slower (and of course, all of these models are much faster than any of the others not included here).

I suspected that Streaming_core wasn't being as aggressively inlined as Strymonas so decided to test a few things. One of them worked – adding an explicit [@inline] attribute at the call site of flat_map as in ... |> flat_map (fun [@inline] ...).

With this change the results are:

ocamlfind opt \
        -O3 \
        -package benchmark \
        -o ./standalone -linkpkg src/Standalone.ml
Throughputs for "Streaming_core", "Strymonas", "Streaming_safe", "Iter" each running 2 times for at least 3 CPU seconds:
Streaming_core:  3.14 WALL ( 3.13 usr +  0.01 sys =  3.14 CPU) @ 42759.39/s (n=134330)
                 3.12 WALL ( 3.11 usr +  0.01 sys =  3.12 CPU) @ 43049.95/s (n=134330)
     Strymonas:  3.15 WALL ( 3.15 usr +  0.01 sys =  3.15 CPU) @ 42829.43/s (n=134920)
                 3.15 WALL ( 3.14 usr +  0.01 sys =  3.15 CPU) @ 42826.70/s (n=134920)
Streaming_safe:  3.17 WALL ( 3.16 usr +  0.01 sys =  3.17 CPU) @ 16589.23/s (n=52535)
                 3.17 WALL ( 3.16 usr +  0.01 sys =  3.17 CPU) @ 16579.33/s (n=52535)
          Iter:  3.15 WALL ( 3.14 usr +  0.01 sys =  3.15 CPU) @ 19475.23/s (n=61345)
                 3.15 WALL ( 3.15 usr +  0.01 sys =  3.15 CPU) @ 19469.43/s (n=61345)
                  Rate      Streaming_safe        Iter  Strymonas Streaming_core
Streaming_safe 16584+-  5/s             --        -15%       -61%           -61%
          Iter 19472+-  3/s            17%          --       -55%           -55%
     Strymonas 42828+-  1/s           158%        120%         --          [-0%]
Streaming_core 42905+-138/s           159%        120%       [0%]             --

Where Streaming_core is on par with Strymonas! I am genuinely surprised, but can't really see anything wrong with the benchmark.

Benchmark analysis

I do understand that adding an [@inline] attribute at the call site is somewhat cheating, but in all fairness that's exactly what Strymonas does! And I assume that it wouldn't be able to inline calls to external functions in a realistic scenario unless they are explicitly staged.

I did also try applying [@inline] at the implementation site of flat_map as in f [@inlined]) x, but flambda warned me with:

File "src/Standalone.ml", line 31, characters 20-36:
31 |         let inner = (f [@inlined]) x in
                         ^^^^^^^^^^^^^^^^
Warning 55 [inlining-impossible]: Cannot inline: [@inlined] attribute was not used on this function application (the optimizer did not know what function was being applied)

Which makes sense because it cannot do this for the general case, but it actually runs equally fast (!), so I assume it actually did inline this?

In any case, if you have a chance to review this, I'd love to know your thoughts on these results.

rizo commented 2 years ago

CC-ing @c-cube and @Drup who might find these results interesting.

gasche commented 2 years ago

What we learn from this benchmark, I think, is that if you define all your streaming functions [@inline], then Flambda does a good job of inlining everything, except that you also need to mark some user-defined functions inline as well.

The fact that Strymonas is not faster than library versions that inlines everything is not a surprise: it contains no magic low-level optimization, what it does is to reliably inline everything. Library-level inlining (without staging) tends to be fairly fragile.

This raises the following questions:

does this success of "managing to inline everything" (at reasonable cost in compilation time etc.) for the Stream_core version scale to larger pipelines?
would it be possible for the optimizer to let us tag higher-order function arguments with @inline, instead of forcing this decision on the caller?

I think that (in a world where BER MetaOCaml is upstreamed) it would be reasonable to ask users to put staging annotations in their code, and too fragile to ask them to put inlining annotations in their code (as you did in your experiment here):

staging annotations are tracked in types, you have to put them at a fixed number of places and the type-checker will stop you if you put too little or too much
on the contrary, @inline is fragile, forget an important one and performances degrade silently, sprinkle too much of them and you pay in build time, code size, etc.

The [@inline] approach is a first step towards the route of GHC's rewrite pragmas. We know it can go surprisingly far, but we also know that it never ends up very satisfying because of the fragility (and compile times, etc.).

rizo commented 2 years ago

I have updated the benchmarks to include Strymonas and the new Streaming_core (aka Push_fold_ref) model I implement as part of this test. Detailed results are at https://docs.google.com/spreadsheets/d/1OdlEwwunb4ibhHgkwR0I4cRIgOIOHqXRZTtxtoTd6JE

I will run the benchmarks on a AMD CPU soon too.

Could you try reproducing the results, if you have time?

rizo commented 2 years ago

Regarding your questions, @gasche:

does this success of "managing to inline everything" (at reasonable cost in compilation time etc.) for the Stream_core version scale to larger pipelines?

Do you have any suggestions on concrete tests I could implement? I have been using streaming for some event-processing problems, but unfortunately the source code is not open.

would it be possible for the optimizer to let us tag higher-order function arguments with @inline, instead of forcing this decision on the caller?

Actually, as I tried to explain in my initial comment, adding the @inline attribute to the higher-order function did work even though flmabda produced a warning. I'm not sure if this should be considered a bug or the warning simply alerts about the higher likelihood of "fragility" of this inlining.

rizo commented 2 years ago

I find it incredible that no other model apart from Push_fold_ref (originally Streaming_core) benefits from [@inline] at the caller site.

rizo commented 2 years ago

I run the benchmark on an AMD Ryzen 7 2700X, 3.7 GHz CPU. The results are overall comparable, in some instances Strymonas is slightly faster than Push_fold_ref. The absolute throughput is slightly lower compared to M1.

gasche commented 2 years ago

Do you have any suggestions on concrete tests I could implement?

Yes, but my answer may well be outside the scope of the amount of effort one should go for fun (and for free): write a stream pipeline generator that can generate pipeline of arbitrary sizes, and plot the compilation/startup time and throughput of Strymonas and your other best solutions as the size increases.

It's not obvious whether one should only consider linear pipelines (f1 |> f2 |> f3 |> f4 |> ...) or trees let t1 x = x |> f1 |> f2 in let t2 x = x |> f3 |> f4 in let t3 x = x |> t1 |> t2 in ...), and how to choose the base operations -- a mix of size-preserving transformations and of irregular transformations like filter or flat_map. It's also not obvious how to generate well-typed programs using flat_map in this scenario. (I would start with the simplest approach, which is to generate linear pipelines and avoid flat_map, basically one unfold, a sequence of maps with some filters thrown in, then one fold).

Drup commented 2 years ago

Thanks @rizo that's very interesting.

@gasche There's an alternative solution, which is to annotate functions for partial evaluation (in the style of AnyDSL). It's probably a middle point between proper staging's control, and [@inline] "ease of use". It would work well here, since you could annotate the first arrow of fold_map itself to be partially evaluated.

rizo / streams-bench