Closed lmatz closed 1 month ago
One immediate thought is to decompose the big complex query automatically into groups of operators (limited to the dependency) and backfill each group one by one
One point to note is probably that, by "backfilling" it means that the result sets of these groups have to be materialized, so extra overhead is also introduced.
Also link to https://github.com/risingwavelabs/risingwave/issues/12771 cuz it sounds like an opposite to the idea proposed here, that is, encourage users to create multiple materialized views together based on change logs and avoid backfilling as much as possible.
Anyway, let me make it into the documentation as a suggestion first.
In terms of the extra materialization, I remember there was an issue that discussed the idea of keeping the query but removing the results in the MV.
This issue has been open for 60 days with no activity.
If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity
label.
You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄
With serverless backfilling enabled on RW Cloud, this issue becomes less significant.
it may be related to #14559, just tag it here for visiblity.
One observation is that users often write a large complex query that has many CTE/subqueries and then build an MV by this query.
This is arguably not the best way to organize queries, but it does appear often.
Maybe because it is just the way how users like to organize their queries, or maybe because the queries are migrated from existing OLAP systems in tech stacks, or perhaps users are given the impression that this is the most efficient way of building an MV (e.g. maybe in terms of latency, maybe in terms of not having extra storage cost by having intermediate MVs).
Anyway, the reasons can be various and generally unknown to us. And, without users approaching us, we may not have the chance to give better suggestions.
However, it is not an efficient way of building an MV.
The reason is intuitive: running a query with 10 CTE/subqueries is similar to running 10 different queries at the same time)
There can be serious competition for resources:
We remark that the competition for resources is not just implying that each of the 10 queries gets 1/10 of total resources, so each of them is expected to deliver 1/10 of the performance, but the fact that extra overhead leads to even worse performance than 1/10.
One immediate thought is to decompose the big complex query automatically into groups of operators (limited to the dependency) and backfill each group one by one (Of course, if the resources are more than enough, we can parallelize the execution of multiple groups).
It can also be seen as inserting implicit MVs into this big complex query but these MVs are transparent to users.
Also, aggregations can produce many intermediate changes, it is simply better if we can build the state of subsequent operators and (implicit)MVs on top of the final results of backfilling instead of processing unnecessary intermediate changes.