risingwavelabs / risingwave

SQL stream processing, analytics, and management. We decouple storage and compute to offer efficient joins, instant failover, dynamic scaling, speedy bootstrapping, and concurrent query serving.
https://www.risingwave.com/slack
Apache License 2.0
6.64k stars 545 forks source link

Discussion: find a way to let users share states across multiple streaming queries #9892

Open chenzl25 opened 1 year ago

chenzl25 commented 1 year ago

Some users might maintain their streaming queries in this way.

  1. Create some sources.
  2. Create some views on those sources.
  3. Create some sinks or materialized views on those views.

If we create those sinks or materialized views one by one, it would result in duplicated states, since views could be used more than once. Currently, we support sharing states in a single query, but we are unable to share states across multiple streaming queries.

One possible solution is to optimize multiple queries at the same time so that we can have a bird's-eye view. Obviously, it needs to have batch creating streaming query interfaces from end to end (e.g. optimizer, meta, scheduler).

Another possible solution is to let users create intermediate materialized views instead of views for the second step. After users finish the third step, we can provide a way to truncate the intermediate materialized view and let it never materialize its input anymore and finally, make them invisible to users.

lmatz commented 1 year ago

I imagine that MV is created one after another:

  1. Some demand is generated.
  2. Data Engineer figures out how to implement it in RW SQL.
  3. Test the MV by batch queries or by running it on some data of a smaller scale, and then create MV in the production RW cluster.
  4. After a couple of days, back to step 1.

My worry is that once MVs are created, the users may hesitate to change them, drop them, or whatever modification, to share the intermediate state with some new MVs that are about to be created, which suggests that we may not have the luxury of optimize multiple queries at the same time.

But I do think truncate the intermediate materialized view as a standalone optimization is quite useful. Sometimes, the user may realize only after a while that the MV they really want is doing some further transformation of some existing MV, which may make the existing one obsolete. Dropping the old one and completely re-building the new one could be slow. Re-building from source may not even be possible in some cases due to the data retention limit in the upstream source.

BugenZhao commented 1 year ago

After users finish the third step, we can provide a way to truncate the intermediate materialized view and let them never materialize its input anymore and finally, make them invisible to users.

I've been thinking of the exact same approach with this months ago!

However, after some offline discussions, we find that: in practice, it's inevitable that more materialized views are needed as the business grows, and "providing all queries at the same time" seems too ideal. If we also want to apply the state reuse optimization for these new materialized views, then we have to find a way to do this in an incremental or patch-like way.

If this really gets implemented, then it could be a superset of the solution proposed in this issue. Since we're now able to do optimization incrementally, then it also works for creating materialized views one by one... 🤔

chenzl25 commented 1 year ago

I do agree in practice, there will be more and more materialized views as the business grows. Anyway, it seems we can provide a conversion between view and materialized view. Converting a materialized view to a view seems exactly the thing truncate materialized mentioned before.

github-actions[bot] commented 1 year ago

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

fuyufjh commented 1 year ago

Any futher updates?

github-actions[bot] commented 3 weeks ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄