hydro-project / hydroflow

Hydro's low-level dataflow runtime
https://hydro.run/docs/hydroflow/
Apache License 2.0
478 stars 35 forks source link

Performance testing for mapping into/out of joins #185

Open jhellerstein opened 2 years ago

jhellerstein commented 2 years ago

Is there an expense to the mapping once the compiler has done its magic, relative to tuples that "happen to be" set up right and don't need maps? Would a closure for "key access" on each input help the compiler more than mapping? And/or should we have some fast-path that makes the "relational joins on relational data" go fast?

MingweiSamuel commented 1 year ago

Ex:

Cost of putting things into tuples Cost of selecting Nth value as the join key (rather than only the first) New join_by operator?

MingweiSamuel commented 1 year ago

Can maybe tell just by the current flame graphs if this is causing any performance overhead? @zzlk

zzlk commented 1 year ago

I took a quick initial swing at this and from what I can tell, to my surprise, the behavior of the mapped versions do indeed perform differently. I was not expecting this. I was expecting the mapped ones, as long as the data is the same, to compile down to exactly the same code, so no difference should be detectable. I did try to create some scenarios in https://godbolt.org/ but it is hard to coax the compiler into doing what you want, and then interpreting the resulting assembly is harder still. This would normally be the approach I would take with C/C++ but for rust it didn't work out. So instead I went with the benchmarking approach.

The benchmark I made was to feed data into a join, and time how long it takes to do the whole join. The data that is generated is either fitting the join or needs a map, these two options are compared against each other. A similar thing is done for the output side, giving 4 total combinations [(nomap, nomap), (map, nomap), (nomap, map), (map, map)]. I also added one more test case where the output map outputs a larger tuple than what is inputted, it duplicates one of the tuple elements. This is a control, it should be slower, since it is copying memory/outputting more data.

I ran the benchmarks with

sudo nice -n-20 taskset 0x4 sudo -u stan cargo bench -p benches -- mapjoinperf --output-format bencher --measurement-time 10

making sure to disable the intel turbo boost feature on my cpu otherwise it's hard to get sensible results:

echo "1" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

and I got these results:

2023-03-15_23:30:18

So, likely need to spend a bit more time on this, make the key-access operator and see how it performs in comparison to map in/map out, as well as try to understand the above benchmark, why adding a map() can sometimes make it faster.

jhellerstein commented 1 year ago

Hypothesis is that LTO makes this irrelevant; need to confirm.