risingwavelabs / risingwave

SQL stream processing, analytics, and management. We decouple storage and compute to offer efficient joins, instant failover, dynamic scaling, speedy bootstrapping, and concurrent query serving.
https://www.risingwave.com/slack
Apache License 2.0
6.62k stars 543 forks source link

discussion: backfill subqueries/CTEs "one by one" in a big complex query automatically #17404

Open lmatz opened 3 weeks ago

lmatz commented 3 weeks ago

One observation is that users often write a large complex query that has many CTE/subqueries and then build an MV by this query.

This is arguably not the best way to organize queries, but it does appear often.

Maybe because it is just the way how users like to organize their queries, or maybe because the queries are migrated from existing OLAP systems in tech stacks, or perhaps users are given the impression that this is the most efficient way of building an MV (e.g. maybe in terms of latency, maybe in terms of not having extra storage cost by having intermediate MVs).

Anyway, the reasons can be various and generally unknown to us. And, without users approaching us, we may not have the chance to give better suggestions.

However, it is not an efficient way of building an MV.

The reason is intuitive: running a query with 10 CTE/subqueries is similar to running 10 different queries at the same time)

There can be serious competition for resources:

  1. memory. We learned from many PoC cases recently that high remote IOPS induced by cache miss is one of the most important factors that lead to bad performance.
  2. CPU. The block cache/meta cache will be very busy bringing in and evicting data. The data needs to be deserialized before they can be used by operators, which consumes more CPU.
  3. Remote I/O. Due to (1), RW may hit object storage's rate limit more frequently.

We remark that the competition for resources is not just implying that each of the 10 queries gets 1/10 of total resources, so each of them is expected to deliver 1/10 of the performance, but the fact that extra overhead leads to even worse performance than 1/10.

One immediate thought is to decompose the big complex query automatically into groups of operators (limited to the dependency) and backfill each group one by one (Of course, if the resources are more than enough, we can parallelize the execution of multiple groups).

It can also be seen as inserting implicit MVs into this big complex query but these MVs are transparent to users.

Also, aggregations can produce many intermediate changes, it is simply better if we can build the state of subsequent operators and (implicit)MVs on top of the final results of backfilling instead of processing unnecessary intermediate changes.

BugenZhao commented 3 weeks ago

One immediate thought is to decompose the big complex query automatically into groups of operators (limited to the dependency) and backfill each group one by one

One point to note is probably that, by "backfilling" it means that the result sets of these groups have to be materialized, so extra overhead is also introduced.

Also link to https://github.com/risingwavelabs/risingwave/issues/12771 cuz it sounds like an opposite to the idea proposed here, that is, encourage users to create multiple materialized views together based on change logs and avoid backfilling as much as possible.

lmatz commented 2 days ago

Anyway, let me make it into the documentation as a suggestion first.

In terms of the extra materialization, I remember there was an issue that discussed the idea of keeping the query but removing the results in the MV.

12771 can also be reasonable, depending on the cluster utilization and workload.