risingwavelabs / risingwave

SQL stream processing, analytics, and management. We decouple storage and compute to offer efficient joins, instant failover, dynamic scaling, speedy bootstrapping, and concurrent query serving.
https://www.risingwave.com/slack
Apache License 2.0
6.6k stars 540 forks source link

enhancement(dynamic-filter): in-memory cache of outer side #10744

Open fuyufjh opened 1 year ago

fuyufjh commented 1 year ago

Is your feature request related to a problem? Please describe.

See https://github.com/risingwavelabs/rfcs/blob/main/rfcs/0033-dynamic-filter.md

Currently, the outer side rows are not cached, which results in a table scan on every barrier.

As we discussed today, once the duration of processing a barrier exceeds barrier frequency, the whole streaming graph will be completely filled with barrier and not actual data can be processed. Without caching, This can easily happen.

Describe the solution you'd like

Add cache for outer-side rows.

A very primitive idea is that we can differs the cases of monotonically increasing variable (such as NOW() from NowExecutor) and others. For monotonically increasing variables, only the larger values than current need to be cached. Otherwise, values around current need to be cached because the value can either go up or down.

The caching policy seems to be complicated. May need an additional RFC for this.

Describe alternatives you've considered

No response

Additional context

No response

kwannoel commented 12 months ago

Summarize some discussion internally / offline: There are 3 levels of fixes from simple to more complex, and increasing generality:

  1. We can have single-row cache for monotonically increasing dynamic-filter. See https://github.com/risingwavelabs/risingwave/pull/10895 for more details.
  2. Suggested by @hzxa21. This is a generalization of 1. Caching ["Top" M rows < prev_outer_value, "Bottom" N rows > prev_outer_value] sounds like a generic approach. For a simple workaround, we can just use M=0 and N=1 for this case. This won't prevent us from further optimization and also don't need to have special logic for timestamp data type.
  3. (Caveat: Requires lots of work and investigation): https://db.cs.cmu.edu/papers/2018/mod601-zhangA-hm.pdf. Suggested by @chenzl25 originally. General cache for range scan. Implement at storage layer.
fuyufjh commented 10 months ago

Is this completed?

kwannoel commented 10 months ago

Is this completed?

With WatermarkCache I think the underlying issue is somewhat mitigated: https://github.com/risingwavelabs/risingwave/issues/11320.

Mentioning the slack discussion here for further context: https://risingwave-labs.slack.com/archives/C04NK8HD44R/p1690340513855229?thread_ts=1690331890.873379&cid=C04NK8HD44R.

Still planned but recently didn't work on it.

github-actions[bot] commented 6 days ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄