Open martint opened 2 years ago
Optimized storage of GROUP BY intermediates for fixed-size types to improve memory locality and avoid multiple indirections.
There already is https://github.com/trinodb/trino/pull/10706, but IMO we should focus on variable length types since multi channel aggregations often use varchars.
cc @lukasz-stec
I think more contributors can involve in this project if there is some specific opening issues
Also ping us on the project-hummingbird slack channel https://trinodb.slack.com/archives/C04APR44U20
"Megamorphism and virtual dispatch in core loops due to call sites seeing multitude of block types"
I have seen degradation in HashBuilderOperator
performance due to this when DictionaryBlocks
were pushed through the partitioned exchange (more details https://github.com/trinodb/trino/issues/15216)
Suboptimal code generation for complex expressions and required null checks
Is there any issue to further clarify for this topic then we can follow? Are we still working on code generation based on airlift/bytecode library and improve it?
Introduce abstractions and batch calling conventions to facilitate the implementation of functions and operators that can leverage SIMD instructions via Java's new Vector API, and, in the future, possibly GPUs via OpenCL or CUDA
Besides to the SIMD instructions, can we consider introducing operator and expression evaluation framework based on a native JIT engine (such as code generation through LLVM)?
Basically, we have two options: option 1- improve performance on pipeline level: when a physical pipeline operators hands over to a Trino worker, we can first rewrite the Trino physicals plan into substrait based plan, then compile the substrait plan into IR code through LLVM API, and execute the generated IR code by given trino page input to get the results as arrow data formats, finally convert the arrow result back to trino page format.
option 2: improve performance on expression and operator level: when operator::getOutput() was invoked, then forward the request to native based operator call through JNI, and the native operator is optimized based on the IR code(LLVM IR).
expression evaluation framework based on a native JIT engine (such as code generation through LLVM)
Trino already does that via the JVM's JIT compiler. It produces JVM bytecode that the JVM then turns into native CPU instructions.
Recent PR: https://github.com/trinodb/trino/pull/21465
Trino has had a columnar/vectorized evaluation engine since its inception in 2012. After the initial implementation and optimization, and once we were satisfied with the performance for the majority of the use cases, we focused our efforts in other areas. Although we've made further incremental performance improvements in the past few years, there is still room for further optimization.
We're starting Project Hummingbird with the goal of bringing Trino's columnar/vectorized evaluation engine to the next level. This includes improvements in areas such as filter, projection, aggregation and join evaluation, as well as any other potential improvements in areas we identify along the way. So far, we have the following list:
Tasks