Discussion: collect statistics of internal state tables, indexes and MVs

lmatz commented 1 year ago

Although a cost-based optimizer is not a near-term goal as discussed in #7014, we have some other motivations to collect statistics:

Users can use these statistics when they themselves plan the query in a better way in the future by using e.g. selectivity, cardinality, etc.
Explainability. For example, if the cache hit rate is low, it could be multiple reasons, or simply because there are too many different keys. This can be good evidence for users to consider allocating more resources.
In the future, we may either Expose Materialized Views with Data Lakes or Embed 3rd-party Query Engine. Then the statistics become useful for other engines.
......

One difference between these statistics and the metrics shown on the Grafana dashboard is that these statistics need to be durable and be kept somewhere within the kernel.

fuyufjh commented 1 year ago

For the cost-based optimizer (CBO), my concern is that it could do very little help for streaming queries, because:

Streaming queries may also involve some sources, which has definitely no statistics availble
Some tables may be empty or almost empty when creating a Materialized View, at that time, these statistics can even mislead the optimizer.
Even though the statistics got updated, we can hardly update the existing streaming query to make it perform better.

In summary, I think CBO is good for OLAP queries but might not be helpful for streaming queries and serving.

lmatz commented 1 year ago

yes, CBO is not worth it

neverchanje commented 1 year ago

selectivity, cardinality

Could we first expose them as metrics?

lmatz commented 1 year ago

selectivity, cardinality

Could we first expose them as metrics?

as long as it does not rely on or is stored in another third-party system.

fuyufjh commented 1 year ago

cc. @hzxa21 How do you think?

hzxa21 commented 1 year ago

In summary, I think CBO is good for OLAP queries but might not be helpful for streaming queries and serving.

+1.

Without considering CBO, I am also skeptical about the benefits of introducing statistics. I think we need to be clearer about what statistics we think are useful and how to maintain statistics.

Users can use these statistics when they themselves plan the query in a better way in the future by using e.g. selectivity, cardinality, etc.

Explainability. For example, if the cache hit rate is low, it could be multiple reasons, or simply because there are too many different keys. This can be good evidence for users to consider allocating more resources.

Maintaining cardinality makes sense to me but I am not sure whether streaming join selectivity needs to be durable. IMO, if user cares about streaming join selectivity, it cares more about the trend of the changes in selectivity to see whether there is anything wrong with the data or the streaming SQL. This looks more like a metric shown in grafana and doesn't need to be stored as a durable statistics.

In the future, we may either Expose Materialized Views with Data Lakes or Embed 3rd-party Query Engine. Then the statistics become useful for other engines

True. But this is more about querable states like MVs, not internal table states. We are still exploring this topic and don't have a mature idea on how to allow efficient OLAP query from 3rd engine or data lakes. We may a columnar storage/format and other than per table statistics, we may also need finer grain per table per data file statistics as well. Since the focus right now is more on the serving and streaming side, I think we can postpone the dicussion on this point for now.

Several ideas on how to maintain statistics:

Create associated MVs for different statistics (#5529). In other words, using streaming query to get statistics.
Don't maintain statistics but just use ad-hoc query to calculate statistics on demand since the use cases are more about explainability and observability, not CBO. In other words, using batch query to get statistics.
Maintain per table per data file statistics (e.g. # of distinct keys for table A in file 1). This is mostly useful for OLAP query to conduct more aggressive pruning and pre-aggregation.

lmatz commented 1 year ago

Let's hold this for a while and I will draft a doc for further discussion.

As a late and quick response: We all agree that a cost-based optimizer is not worth it. But the responsibility and complexity of query optimization are transferred(partially) to the users and are still there. Meaning that if one statistic is supposed to be helpful in a CBO, then it should still be beneficial for users if they want to alter the plan manually, although it may be beyond an average user's understanding and domain knowledge.

Streaming queries may also involve some sources, which has definitely no statistics available

True, we can only collect the statistics for a while first and then benefit the queries to be executed in the future.

Some tables may be empty or almost empty when creating a Materialized View, at that time, these statistics can even mislead the optimizer.

True. But if the user chooses to intervene in the optimization, the user can ignore unreliable statistics. We can show how much data is collected to calculate one statistics, e.g. confidence interval sort of thing.

Even though the statistics got updated, we can hardly update the existing streaming query to make it perform better.

It depends on how much better we can make it and how important the query is. Currently, we have no feedback from users, but it is possible that it can be a killer feature. We can have a new plan running concurrently and replace the old one when it finally catches up, although the implementation is definitely non-trivial. One observation is that a stream query that cares about low latency mostly just focuses on the data within a recent period of time -> not much historical data to recompute. Will collect more feedback with respect to this in the future.

needs to be durable.

If it only needs to be visible to users, Grafana is indeed enough. Durable or not will depend on whether these statistics will be used internally or exposed to third-party systems in the future, or when the user only has access to psql in the terminal, which I suppose is a rare case. If using create associated MVs for different statistics, then it is automatically durable. Although multiple associated MVs may introduce some extra overhead other than the overhead of calculating statistics.

Don't maintain statistics but just use ad-hoc query to calculate statistics on demand I think it is good enough as this is also what traditional DBs are doing, e.g. Postgres ANALYZE and VACUUM and stores them in pg_catalog.pg_statistic.

github-actions[bot] commented 1 year ago

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

risingwavelabs / risingwave

Discussion: collect statistics of internal state tables, indexes and MVs #7317