risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.99k stars 575 forks source link

expr: avoid repeating the same scalar into an array #9052

Open BugenZhao opened 1 year ago

BugenZhao commented 1 year ago

For example, there's an EXTRACT(HOUR FROM col) in Nexmark Q14, where the HOUR is compiled to a literal VARCHAR expression. When evaluating the EXTRACT, we need to first repeat the same scalar "HOUR" 1024 times into an array, then evaluate the outer EXTRACT function. This is not efficient. https://github.com/risingwavelabs/risingwave/issues/8503#issuecomment-1465808412

Possible solutions:

xxchan commented 1 year ago

This seems hard to do this under current architecture as we always use static type for arrays, so introducing a wrapper requires a lot of changes.

Can you elaborate this? How is "static type" a problem and how dynamic is ConstantArray/RunArrary?

BugenZhao commented 1 year ago

This seems hard to do this under current architecture as we always use static type for arrays, so introducing a wrapper requires a lot of changes.

Can you elaborate this? How is "static type" a problem and how dynamic is ConstantArray/RunArrary?

For example, arrays in arrow and arrow2 are all trait objects, so it can introduce a RunArray wrapper easily without exposing it to any callers. However in our type system, we need to write a lot of stuff like MaybeRun<Utf8Array> or MaybeRun<ArrayImpl>. šŸ¤”

xxchan commented 1 year ago

How is the situation now after #9049?

BugenZhao commented 1 year ago

How is the situation now after #9049?

I guess the ultimate solution should be allowing Value::Scalar to directly be passed among different executors and even remote actors, as described in https://github.com/risingwavelabs/risingwave/pull/9733#issuecomment-1543658669. But yes, It appears that #9049 has accomplished everything we can do without introducing a significant refactor. šŸ˜„

xxchan commented 1 year ago

FWIW this looks similar šŸ‘€ https://github.com/apache/arrow-rs/issues/1047

kwannoel commented 5 months ago

Wonder if we can further generalize this into some compact encoding for multiple repeated datums. It could potentially optimize join performance, since the datums in the join key don't need to be expanded inline.

kwannoel commented 5 months ago

Specifically for high amplification join, when building the new chunk, the probe side's record, just needs to convert its scalar values into constant array, then we can just concat that with the build side to form the new stream chunk.

BugenZhao commented 5 months ago

Wonder if we can further generalize this into some compact encoding for multiple repeated datums.

Yes. Are you referring to...

BugenZhao commented 2 months ago

Just FYI: eval_v2, introduced in #9049, is not adopted by all proc-macro-generated function impl any more.