euphoria-flink: data type transparency

Execution engines can do much better at optimizations if they transparently know what types they are working with. Efforts in the Spark as well as the Flink community proof optimization potentials in this regard. As a layer above such execution engines Euphoria must provide type specific information to executors in order to stay relevant (in terms of performance). In this ticket we'll focus on Flink.

Motivation

The primary background for this ticket was triggered by an attempt to shave off some of the overhead mentioned in #13 and #14.

Experiment

Using opaque data types with general purpose serialization has considerable, negative effects on optimizations that Flink tries to apply by default. I was able to see the effect in an experiment, where the "core" operation of the flow is the following (basically just a windowed word-count):

      ReduceByKey
            .of(input)
            .keyBy(Pair::getSecond)
            .valueBy(e -> 1L)
            .combineBy(Sums.ofLongs())
            .windowBy(Time.of(shortInterval), Pair::getFirst)
            .output();

Changing Euphoria's flink batch executor such that it uses Flink's native and Flink's pojo based serializers for the types involved during the reduce operation, squeezed out about half of the original execution time of the program. The amount of data shuffled was approximately the same. Unfortunately, such an approach required me to explicitly provide the return type of the .keyBy function to Euphoria's FlinkExecutor's internals. I was not able to derive the return type of a lambda in an automatic manner without the user having to explicitly state it.

What to do next

[ ] Find a non-verbose way for clients to provide required type information to the executor translators, i.e. return types of UDFs (we might not really succeed here and might end up with some verbose construct; maybe we can make the verbose constructs and the implied optimizations merely an opt-in?)
[ ] Leverage the types in Euphoria's Flink batch and stream executors.
- Leverage type information for objects used internally in both flink executors to avoid general purpose serialization by kryo when possible.
- Using TupleX instead of Pair and Tripple is a good start.
- For internal types, utilizing POJOs instead of opaque data types is another easy gain.
- Note: it will not help much if we switch over to Tuple but fail to provide type information of the actually used fields.
- Attention: proper delegation of the window types from one operator to another due to attached windowing may require substantial changes to the current executor code.

seznam / euphoria