Open gasche opened 2 years ago
Thanks for suggesting this, I've been meaning to add strymonas to the list for some time now.
I did actually measure the performance of strymonas using ppx_stage (I simply copied the generated code for certain operations). From what I remember Iter and Streaming.Stream were within 10% of strymonas, which, of course, was the fastest of them all.
Unlike Streaming.Stream, the current version of strymonas does not manage streaming resources, but it is a very good baseline, as you suggested.
I'll try to find some time to add strymonas to the benchmarks and maybe even automate the execution with ppx_stage and dune.
Hi @gasche,
I have added some initial ppx_stage
-powered Strymonas results and the results are... interesting!
First, allow me to present the details of the benchmarking setup:
ppx_stage
generator can be found here: https://github.com/rizo/streams-bench/tree/master/ppx_stage_strymonas. More specifically:
Standalone.ml
to make it easier to reproduce the benchmarks and make modifications.
ocamlopt
as in https://github.com/rizo/streams-bench/blob/master/Makefile#L2Standalone.ml
:
Iter
- our well-known push iterator used as a reference point;Strymonas
- I simply inlined the expanded code here for now;Streaming_safe
- is the current implementation of Streaming.Stream
from my streaming library;Streaming_core
- is a simplified version of above that does not provide resource safety and uses mutability. My intention with this model was to see how close I could get to Strymonas
.The Strymonas
benchmark code is:
let (--) i j =
Strymonas.unfold (fun x ->
[%code if [%e x] = [%e j] then None else Some ([%e x], [%e x] + 1)]) i
let strymonas_test () =
let open Strymonas in
[%code 0] -- [%code 1024]
|> map (fun x -> [%code [%e x] + 1])
|> filter (fun x -> [%code [%e x] mod 3 = 0])
|> take [%code 512]
|> flat_map (fun x -> x -- [%code [%e x] + 30])
|> fold (fun x y -> [%code [%e x] + [%e y]]) [%code 0]
The Streaming_core
model is implemented as:
type +'a t = { fold : 'r. 'r -> ('r -> 'a -> 'r) -> bool ref -> 'r }
(** A push-based fold stream with early termination indicated by the boolean reference. *)
Now, these are the results of my first benchmark (on a M1 MacBook, 4.14.0+flambda
):
ocamlfind opt \
-O3 \
-package benchmark \
-o ./standalone -linkpkg src/Standalone.ml
Throughputs for "Streaming_core", "Strymonas", "Streaming_safe", "Iter" each running 2 times for at least 3 CPU seconds:
Streaming_core: 3.14 WALL ( 3.14 usr + 0.01 sys = 3.14 CPU) @ 27448.49/s (n=86300)
3.15 WALL ( 3.14 usr + 0.01 sys = 3.15 CPU) @ 27413.19/s (n=86300)
Strymonas: 3.15 WALL ( 3.15 usr + 0.01 sys = 3.15 CPU) @ 42864.15/s (n=135151)
3.15 WALL ( 3.15 usr + 0.01 sys = 3.15 CPU) @ 42864.92/s (n=135151)
Streaming_safe: 3.19 WALL ( 3.19 usr + 0.00 sys = 3.19 CPU) @ 17039.39/s (n=54377)
3.19 WALL ( 3.18 usr + 0.01 sys = 3.19 CPU) @ 17061.65/s (n=54377)
Iter: 3.15 WALL ( 3.14 usr + 0.01 sys = 3.15 CPU) @ 19451.36/s (n=61268)
3.15 WALL ( 3.14 usr + 0.01 sys = 3.15 CPU) @ 19457.11/s (n=61268)
Rate Streaming_safe Iter Streaming_core Strymonas
Streaming_safe 17051+-11/s -- -12% -38% -60%
Iter 19454+- 3/s 14% -- -29% -55%
Streaming_core 27431+-17/s 61% 41% -- -36%
Strymonas 42865+- 0/s 151% 120% 56% --
As expected, Strymonas
is really fast; Streaming_core
is 36% slower (and of course, all of these models are much faster than any of the others not included here).
I suspected that Streaming_core
wasn't being as aggressively inlined as Strymonas
so decided to test a few things. One of them worked – adding an explicit [@inline]
attribute at the call site of flat_map
as in ... |> flat_map (fun [@inline] ...)
.
With this change the results are:
ocamlfind opt \
-O3 \
-package benchmark \
-o ./standalone -linkpkg src/Standalone.ml
Throughputs for "Streaming_core", "Strymonas", "Streaming_safe", "Iter" each running 2 times for at least 3 CPU seconds:
Streaming_core: 3.14 WALL ( 3.13 usr + 0.01 sys = 3.14 CPU) @ 42759.39/s (n=134330)
3.12 WALL ( 3.11 usr + 0.01 sys = 3.12 CPU) @ 43049.95/s (n=134330)
Strymonas: 3.15 WALL ( 3.15 usr + 0.01 sys = 3.15 CPU) @ 42829.43/s (n=134920)
3.15 WALL ( 3.14 usr + 0.01 sys = 3.15 CPU) @ 42826.70/s (n=134920)
Streaming_safe: 3.17 WALL ( 3.16 usr + 0.01 sys = 3.17 CPU) @ 16589.23/s (n=52535)
3.17 WALL ( 3.16 usr + 0.01 sys = 3.17 CPU) @ 16579.33/s (n=52535)
Iter: 3.15 WALL ( 3.14 usr + 0.01 sys = 3.15 CPU) @ 19475.23/s (n=61345)
3.15 WALL ( 3.15 usr + 0.01 sys = 3.15 CPU) @ 19469.43/s (n=61345)
Rate Streaming_safe Iter Strymonas Streaming_core
Streaming_safe 16584+- 5/s -- -15% -61% -61%
Iter 19472+- 3/s 17% -- -55% -55%
Strymonas 42828+- 1/s 158% 120% -- [-0%]
Streaming_core 42905+-138/s 159% 120% [0%] --
Where Streaming_core
is on par with Strymonas
! I am genuinely surprised, but can't really see anything wrong with the benchmark.
I do understand that adding an [@inline]
attribute at the call site is somewhat cheating, but in all fairness that's exactly what Strymonas does! And I assume that it wouldn't be able to inline calls to external functions in a realistic scenario unless they are explicitly staged.
I did also try applying [@inline]
at the implementation site of flat_map
as in f [@inlined]) x
, but flambda warned me with:
File "src/Standalone.ml", line 31, characters 20-36:
31 | let inner = (f [@inlined]) x in
^^^^^^^^^^^^^^^^
Warning 55 [inlining-impossible]: Cannot inline: [@inlined] attribute was not used on this function application (the optimizer did not know what function was being applied)
Which makes sense because it cannot do this for the general case, but it actually runs equally fast (!), so I assume it actually did inline this?
In any case, if you have a chance to review this, I'd love to know your thoughts on these results.
CC-ing @c-cube and @Drup who might find these results interesting.
What we learn from this benchmark, I think, is that if you define all your streaming functions [@inline]
, then Flambda does a good job of inlining everything, except that you also need to mark some user-defined functions inline as well.
The fact that Strymonas is not faster than library versions that inlines everything is not a surprise: it contains no magic low-level optimization, what it does is to reliably inline everything. Library-level inlining (without staging) tends to be fairly fragile.
This raises the following questions:
@inline
, instead of forcing this decision on the caller?I think that (in a world where BER MetaOCaml is upstreamed) it would be reasonable to ask users to put staging annotations in their code, and too fragile to ask them to put inlining annotations in their code (as you did in your experiment here):
@inline
is fragile, forget an important one and performances degrade silently, sprinkle too much of them and you pay in build time, code size, etc.The [@inline] approach is a first step towards the route of GHC's rewrite pragmas. We know it can go surprisingly far, but we also know that it never ends up very satisfying because of the fragility (and compile times, etc.).
I have updated the benchmarks to include Strymonas
and the new Streaming_core
(aka Push_fold_ref
) model I implement as part of this test. Detailed results are at https://docs.google.com/spreadsheets/d/1OdlEwwunb4ibhHgkwR0I4cRIgOIOHqXRZTtxtoTd6JE
I will run the benchmarks on a AMD CPU soon too.
Could you try reproducing the results, if you have time?
Regarding your questions, @gasche:
does this success of "managing to inline everything" (at reasonable cost in compilation time etc.) for the Stream_core version scale to larger pipelines?
Do you have any suggestions on concrete tests I could implement? I have been using streaming for some event-processing problems, but unfortunately the source code is not open.
would it be possible for the optimizer to let us tag higher-order function arguments with @inline, instead of forcing this decision on the caller?
Actually, as I tried to explain in my initial comment, adding the @inline
attribute to the higher-order function did work even though flmabda produced a warning. I'm not sure if this should be considered a bug or the warning simply alerts about the higher likelihood of "fragility" of this inlining.
I find it incredible that no other model apart from Push_fold_ref
(originally Streaming_core
) benefits from [@inline]
at the caller site.
I run the benchmark on an AMD Ryzen 7 2700X, 3.7 GHz CPU. The results are overall comparable, in some instances Strymonas
is slightly faster than Push_fold_ref
. The absolute throughput is slightly lower compared to M1.
Do you have any suggestions on concrete tests I could implement?
Yes, but my answer may well be outside the scope of the amount of effort one should go for fun (and for free): write a stream pipeline generator that can generate pipeline of arbitrary sizes, and plot the compilation/startup time and throughput of Strymonas and your other best solutions as the size increases.
It's not obvious whether one should only consider linear pipelines (f1 |> f2 |> f3 |> f4 |> ...
) or trees let t1 x = x |> f1 |> f2 in let t2 x = x |> f3 |> f4 in let t3 x = x |> t1 |> t2 in ...
), and how to choose the base operations -- a mix of size-preserving transformations and of irregular transformations like filter
or flat_map
. It's also not obvious how to generate well-typed programs using flat_map
in this scenario. (I would start with the simplest approach, which is to generate linear pipelines and avoid flat_map, basically one unfold, a sequence of maps with some filters thrown in, then one fold).
Thanks @rizo that's very interesting.
@gasche There's an alternative solution, which is to annotate functions for partial evaluation (in the style of AnyDSL). It's probably a middle point between proper staging's control, and [@inline]
"ease of use". It would work well here, since you could annotate the first arrow of fold_map
itself to be partially evaluated.
Strymonas is a project to implement efficient and predictable stream fusion using BER MetaOCaml -- and the Lightweight Modular Staging approach in Scala. I would expect such an approach to be an interesting baseline to compare the capacities of other approaches to perform stream fusion. Would you possibly consider measuring the performance of Strymonas on your benchmark example(s)?
(This requires a different opam switch as BER-metaocaml is non-standard, and you cannot hope to reimplement that part yourself. So that would not be included in the automatic benchmark results directly, but at least you could have the sources to run those tests somewhere in the repository, and mention the ballpark results in the README.)