martint commented 2 years ago

Trino has had a columnar/vectorized evaluation engine since its inception in 2012. After the initial implementation and optimization, and once we were satisfied with the performance for the majority of the use cases, we focused our efforts in other areas. Although we've made further incremental performance improvements in the past few years, there is still room for further optimization.

We're starting Project Hummingbird with the goal of bringing Trino's columnar/vectorized evaluation engine to the next level. This includes improvements in areas such as filter, projection, aggregation and join evaluation, as well as any other potential improvements in areas we identify along the way. So far, we have the following list:

Megamorphism and virtual dispatch in core loops due to call sites seeing multitude of block types
Suboptimal code generation for complex expressions and required null checks
- Adaptive scalar operator fusion in filter / projection evaluation loop
- Adaptive expression evaluation that leverages runtime cost and selectivity of expressions
Inefficiencies in creating and filling blocks (BlockBuilder abstraction)
Opportunities for block-specific evaluation and short-circuiting
- RLE/dictionary optimizations when evaluating aggregations and subexpressions in filters and projections
- Runtime per-block traits for specialized, data-dependent processing logic:
  - null presence and null propagation
  - eliminate overflow checks for operations on small numbers
  - eliminate nan handling for operations on data without nans
  - eliminate size checks when variable length data is known to be small
  - optimized algorithms for ASCII-only data
Introduce abstractions and batch calling conventions to facilitate the implementation of functions and operators that can leverage SIMD instructions via Java's new Vector API, and, in the future, possibly GPUs via OpenCL or CUDA
Improve management of intermediate data buffers across operator boundaries
Specialized hash table implementations for small cardinalities
Optimized storage of GROUP BY intermediates for fixed-size types to improve memory locality and avoid multiple indirections.
Take advantage of new JVM features such as VarHandles, MemorySegment and MemoryLayout APIs
Push down selection masks into connectors to improve I/O and decoding performance in certain cases
Parquet reader optimizations to bring it on par with the ORC reader.

Tasks

[x] https://github.com/trinodb/trino/pull/14092
[x] https://github.com/trinodb/trino/pull/14246
[x] https://github.com/trinodb/trino/pull/14423
[x] https://github.com/trinodb/trino/pull/14260
[x] https://github.com/trinodb/trino/pull/17342
[x] https://github.com/trinodb/trino/pull/17688
[x] https://github.com/trinodb/trino/pull/17688
[x] https://github.com/trinodb/trino/pull/18094
[x] https://github.com/trinodb/trino/pull/18039
[x] https://github.com/trinodb/trino/pull/18034
[x] #18106
[x] #18118
[x] #18738
[x] #18948
[x] https://github.com/trinodb/trino/pull/19059
[x] #19302
[x] #19385
[x] #19816
[x] https://github.com/trinodb/trino/pull/21375
Other tasks TBD

sopel39 commented 2 years ago

Optimized storage of GROUP BY intermediates for fixed-size types to improve memory locality and avoid multiple indirections.

There already is https://github.com/trinodb/trino/pull/10706, but IMO we should focus on variable length types since multi channel aggregations often use varchars.

cc @lukasz-stec

WinkerDu commented 2 years ago

I think more contributors can involve in this project if there is some specific opening issues

mosabua commented 1 year ago

Also ping us on the project-hummingbird slack channel https://trinodb.slack.com/archives/C04APR44U20

lukasz-stec commented 1 year ago

"Megamorphism and virtual dispatch in core loops due to call sites seeing multitude of block types"

I have seen degradation in HashBuilderOperator performance due to this when DictionaryBlocks were pushed through the partitioned exchange (more details https://github.com/trinodb/trino/issues/15216)

chaojun-zhang commented 1 year ago

Suboptimal code generation for complex expressions and required null checks

Is there any issue to further clarify for this topic then we can follow? Are we still working on code generation based on airlift/bytecode library and improve it?

Introduce abstractions and batch calling conventions to facilitate the implementation of functions and operators that can leverage SIMD instructions via Java's new Vector API, and, in the future, possibly GPUs via OpenCL or CUDA

Besides to the SIMD instructions, can we consider introducing operator and expression evaluation framework based on a native JIT engine (such as code generation through LLVM)?

Basically, we have two options: option 1- improve performance on pipeline level: when a physical pipeline operators hands over to a Trino worker, we can first rewrite the Trino physicals plan into substrait based plan, then compile the substrait plan into IR code through LLVM API, and execute the generated IR code by given trino page input to get the results as arrow data formats, finally convert the arrow result back to trino page format.

option 2: improve performance on expression and operator level: when operator::getOutput() was invoked, then forward the request to native based operator call through JNI, and the native operator is optimized based on the IR code(LLVM IR).

martint commented 1 year ago

expression evaluation framework based on a native JIT engine (such as code generation through LLVM)

Trino already does that via the JVM's JIT compiler. It produces JVM bytecode that the JVM then turns into native CPU instructions.

hackeryang commented 4 months ago

Recent PR: https://github.com/trinodb/trino/pull/21465

trinodb / trino

Project Hummingbird #14237

Tasks