seznam / euphoria

Euphoria is an open source Java API for creating unified big-data processing flows. It provides an engine independent programming model which can express both batch and stream transformations.
Apache License 2.0
82 stars 11 forks source link

STORY: Performance tuning #11

Open vanekjar opened 7 years ago

vanekjar commented 7 years ago

Subtickets

Goal

According to our observations and measurements it seems Euphoria API layer has significant performance overhead compared to jobs written natively in Apache Flink or Apache Spark.

It can be assumed there will always be some amount of overhead, because Euphoria API adds another layer of abstraction with its additional data structures. Goal of this issue is to lower this overhead as much as possible.

Approximate measurements show the overhead may be as high as tens of percents. More details about performance comparison in following charts:

batch-chart

stream-chart

je-ik commented 7 years ago

Just a suggestion, it would seem reasonable to divide the performance comparison into two subtypes of jobs - CPU bound and IO bound.

vanekjar commented 7 years ago

In previous months we have worked on the performance tuning of all Euphoria executors. We have achieved noticeable improvement compared to the initial state.

I would like to sum up a current state of performance optimizations. Performance was measured using benchmark apps from Wiki Benchmark section.

batch3

stream3

je-ik commented 7 years ago

This looks good! :+1: Is the first figure the state before the start of the optimization? How is spark executor performing after the optimizations compared to the raw Spark? Is there any space for more optimizations?

vanekjar commented 7 years ago

Newly uploaded figures describe the current state. The initial state can be seen above. It seems we couldn't improve Spark executor very much. Details about proposed solutions can be found in #12. Major improvements happened in both Flink executors.

Also current overhead is quite acceptable for our purposes considering real-world applications are way more complex and fixed overhead coming from additional data-structures shuffled with each data element will become insignificant.

je-ik commented 7 years ago

My bad, I somehow ignored the descriptions of the figures. :) The overhead of flink batch executor comes from the (de)serialization during sorting issue? Anyway, these improvements are really promising. I'm a bit confused about the performance of batch Flink compared to Spark, but that is a different story.

vanekjar commented 7 years ago

I have actually resolved the issue with slow sorting by recent PR #112. It helped a lot. But still the major problem with Flink is that in Euphoria we don't know the type information of the shuffled data.

Flink operates heavily on binary form of data unlike to Spark RDD API. It seems this may be the main difference. If you have a look at a typical Flink code it is all about juggling with binary fields directly, no lambdas, no deserialization. It makes a significant performance benefit of native Flink compared to Euphoria where the type information is lost during translation.

DataSet<Tuple2<String, Integer>> input;

input.groupBy(0)
     .sum(1);
xitep commented 7 years ago

while we achieved a lot, i fear the pure overhead on flink is considerably greater than we thought so far.

i just happened to run the streaming benchmarks on text based input, i.e. parsing the input data itself is extremely cheap and that data can be provided very quickly to the consuming downstream operators, and got repeatedly a ratio of 3m:32m when comparing the runtime of the native flink vs. euphoria flink version.

xitep commented 7 years ago

regarding my last comment: as usually the problem was sitting on my chair ;) the reported "3 minutes" runtime was observed while flink ran with a heap storage backend. after fixing this typo and using the rocksdb backend, i'm now back to the figures reported lastly by @vanekjar. sorry for the panic.

je-ik commented 7 years ago

Good to hear! :+1: