Open st1page opened 1 year ago
Seems that we can construct a workload that has no same key to infer the maximum potential improvement in the micro-benchmark? And another workload that has all same key to infer the maximum potential regression 🤔
Seems that we can construct a workload that has no same key to infer the maximum potential improvement in the micro-benchmark?
I think that could be very strict for RisingWave :hot_face: if the number of the distinct group is very high, our LRU and other cache miss rate could be high too,
Before this we must do https://github.com/risingwavelabs/risingwave/issues/9301 and make sure rebuilding the chunk does not take too much overhead
This issue has been open for 60 days with no activity.
If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity
label.
You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄
Currently, the StreamHashAgg constructs Mask on the visibility bitmap to process different groups' data. https://github.com/risingwavelabs/risingwave/blob/8052dec77bc71a0a5895553dcb3022a494d915ad/src/stream/src/executor/hash_agg.rs#L423 This will hurt the performance especially when there is a large number of distinct groups in a chunk. It makes our vectorized processing regressed though we implement the
apply_batch
interface on the aggregators. e.g. if the chunk size is 1024 and there are 128 distinct keys in the group, we still need to call the virtual function(Box<dyn>
) 128 times. Also in some previous analyzes, the aggregator can works better with the compacted chunk https://github.com/risingwavelabs/risingwave/issues/9148 https://github.com/risingwavelabs/risingwave/issues/9150. We can delay theapply_batch
untilget_output()
be called. In other word, we can batching the same group's data between chunks in the same epoch. Pros:apply_batch
which can accept chunks with higher selectivity and reduce the call number ofapply_batch
.Cons: