Optimize case of high join amplification

risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.

https://go.risingwave.com/slack

Apache License 2.0

7.07k stars 582 forks source link

Optimize case of high join amplification #16679

Open kwannoel opened 6 months ago

kwannoel commented 6 months ago

In streaming workload, high join amplification can lead to barrier latency spike.

So far our approaches to mitigate these have been more indirect, e.g. decoupling sink, adaptive rate limit. It can isolate the slowness to the specific stream job.

I think we should also explore optimizing join itself.

[ ] Come up with a high cardinality join scenario, with few columns.
[ ] Come up with a high cardinality join scenario, with many columns.
[ ] Observe the flamegraph to see where it can be optimized further.
[ ] Investigate other systems to see what optimizations can be made in similar situations, e.g. cross join.

github-actions[bot] commented 4 months ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄

kwannoel commented 3 months ago

One thing which @BugenZhao found is the old case where rpc connection can cause deadlock under high join amplification. There's a workaround provided:

kwannoel commented 3 months ago

docs how to address join amplification. Partitioning on non-join-column. (add to tsg)

kwannoel commented 2 months ago

Btw seems like many real-world usecases have inevitable join amplification. So observability alone is insufficient. We need some solution which can enable horizontal scaling to process the join amplification. i.e. when there's high join amp, increasing compute can help to quickly resolve it.

Additionally speeding up the join itself can also greatly help.

1M join amp is a good target to see if our system can reasonably handle it.