fengguangyuan commented 2 years ago

Purpose

This issue is aimed to have a basic optimize rule for min/max/count queries on the connectors having accurate table/partitions/columns statistics, like Iceberg composed of Orc/Parquet files.

Reason

Nowadays, most storage engines or self-describing files are storing table level/partition level/column level statistics to supply a more effective ability of data retrieval, e.g. Iceberg, hive.

We know that Iceberg is now supporting Orc/Parquet files, of which table metrics are aggregated from each data file, therefore it's table metrics is trustworthy for calculating min(T)/max(T)/count(T)/count(*), no matter the stored data is written by Trino or Spark, hence we can manually construct the results, for the queries only with min/max/count aggregations, from metadata.

For example, for query select count(x) from test, if column x has precomputed statistics with 2 total rows, 0 null values and [0, 9] range, the query could be rewritten to select 2, in which 2 is the difference between total rows and nulls count.

Conclusion

Trino should supply an optimize rule to rewrite the queries from metadata, doing the stuff like hive-2847. Obviously, this optimize rule is adaptive to the simple queries without complex syntax, such as group by, distinct, join etc.

Now we had a basic implementation on this issue and tested on Iceberg connector (instead on all connectors considering the statistics maybe inaccurate), if Trino expect this improvement, please let me know, and it's my pleasure to make a PR.

findepi commented 2 years ago

This shouldn't be implemented as an optimizer rule. Instead, Iceberg connector should support applyAggregation. Things to keep in mind

some files may have stats, but some files may be missing stats for a column (see eg https://github.com/trinodb/trino/pull/9938); during planning time we would need to inspect all the files and verify availability of the information
- ideally we should support "mixed mode" where some files are skipped (e.g. we have max for given column) and some files are still processed (we don't have max). Today this would require implementing the aggregations on the connector side, but would be nice if the SPI allowed the connector and the engine to cooperate. For example, a connector would tell the SPI it's accepting the pushdown, but will return "partial aggregation" of some sort. cc @sopel39 @martint @losipiuk
min and max are not guaranteed to be exact.
- eg varchars can be truncated
- timestamp values are rounded up or down to millisecond precision (in case of ORC)
- @rdblue does the spec require min/max to be exact otherwise? or are they defined to be some lower and upper bound only?
the count is the easiest, as it doesn't suffer from exactness doubts, but it's also probably least important. count(*) requires reading no columns, so only file metadata is touched. Room for improvement still.

cc @alexjo2144 @homar

hashhar commented 2 years ago

ideally we should support "mixed mode" where some files are skipped (e.g. we have max for given column) and some files are still processed (we don't have max). Today this would require implementing the aggregations on the connector side, but would be nice if the SPI allowed the connector and the engine to cooperate. For example, a connector would tell the SPI it's accepting the pushdown, but will return "partial aggregation" of some sort. cc @sopel39 @martint @losipiuk

See also https://github.com/trinodb/trino/pull/10964 which could benefit with a similar concept.

fengguangyuan commented 2 years ago

@findepi Thanks for your reply. Yes, indeed what you mentioned is the key points, I don't agree that any more, and that's why I said Limitations.

There two ways to do the optimization, possibly I thought:

As rule: Same as ShowStatsRewrite, just extracting aggregated values from stats, if the stats are not Null or NaN, and the columns stats are reliable.
As work flow: The complex working flow is as what you said, mix-mode, connector should be responsible for aggregating data, may need Local Aggregation in each split (simply regarded as a data file), so that no matter the file has the stats or not, the accurate stats can be computed or extract from each data file adaptively. Therefore more works need on SPI side.

But the two approaches are both based on the assumption: if the expected stats are not NULLs or NaNs, they should be correct, at least for the Iceberg connector, otherwise they are not reliable and should be calculated from the real data.

Considering the implementation complexity, we just simply implemented the easier approach, which will skip rewriting plan once found the unreliable stats or found min/max on the non-numeric columns (except timestamp type), because Trino only carries double ranges, then the plan can still aggregating the real data, so the correctness could be guaranteed, no matter on table level or column level.

After all, the rule approach is the cheapest, while the mix-mode approach is the ideal, but need much more nuts to crack. :)

findepi commented 2 years ago

which will skip rewriting plan once found the unreliable stats or found min/max on the non-numeric columns (expect timestamp type), because Trino only carries double ranges

oh, you mean base the logic on io.trino.spi.statistics.ColumnStatistics? that's not the write API, as statistics are defined to allow them to be inexact, also for numeric types.

per #18, the API to use for aggregation pushdown is the ConnectorMetadata#applyAggregation.

fengguangyuan commented 2 years ago

oh, you mean base the logic on io.trino.spi.statistics.ColumnStatistics? that's not the write API, as statistics are defined to allow them to be inexact, also for numeric types.

Yep, this optimize is only for queries, based on ColumnStatistics.

per https://github.com/trinodb/trino/issues/18, the API to use for aggregation pushdown is the ConnectorMetadata#applyAggregation.

Thanks for the tips, but I think the mix-mode implementation is far more than this one interface.

So you guys prefer the mix-mode implementation? :)

osscm commented 2 years ago

@findepi can you please share any update on this, are we planning to use the min/max if present, for the query optimization? thanks

alexjo2144 commented 2 years ago

I would probably start with an implementation for applyAggregation for count(*). Min/max are going to have the problem of making sure that Parquet/ORC's sorting is the same as what we want for Trino sorting, but count should be a bit simpler. As long as all files report row count stats in the manifest, we don't have to read any data files. (as long as there are no unenforced predicates)

findepi commented 2 years ago

Min/max are going to have the problem of making sure that Parquet/ORC's sorting is the same as what we want for Trino sorting

These must not be different, otherwise predicate pushdown would be totally wrong.

But min/max can be inaccurate (varchars truncated, timestamps rounded).

As long as all files report row count stats in the manifest, we don't have to read any data files.

When some files have row count in the manifest, but some do not, microplans could be useful -- https://github.com/trinodb/trino/issues/13534

alexjo2144 commented 2 years ago

These must not be different, otherwise predicate pushdown would be totally wrong.

IIRC in Delta we had to skip using min/max values for pushdown of Double types in certain situations.

findepi commented 2 years ago

You mean this https://github.com/trinodb/trino/blob/f2b557f6ac17fb338b19bb3d6093e1670bce4ce1/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java#L2543-L2584 ?

it's not strictly because of ordering (as in ORDER BY). Rather, it's because of semantics of double comparisons with NaN (they 5 < NaN and 5 >= NaN are both false). This impacts refusal of a Domain to handle NaNs (these would need to be handled explicitly, just like NULLs).

osscm commented 2 years ago

@findepi @alexjo2144 even if the aggregation like count can be pushed down and use the metrics (if available) then it will be very useful.

So, shall we start with supporting aggregate push down for Iceberg and handle the case for count?

osscm commented 2 years ago

I would probably start with an implementation for applyAggregation for count(*). Min/max are going to have the problem of making sure that Parquet/ORC's sorting is the same as what we want for Trino sorting, but count should be a bit simpler. As long as all files report row count stats in the manifest, we don't have to read any data files. (as long as there are no unenforced predicates)

@alexjo2144 I have started working on a PR for the count(*) support. thanks!

ahshahid commented 1 year ago

Hi. I am new to iceberg and I was also thinking on similar lines , though from a different perspective. Currently spark allows Dynamic Partition Pruning and the underlying data sources return the filter columns only if they are partitioned. If we allow non partition cols also participate in DPP, then the DPP query becomes expensive and is not worth. I am wondering if the DPP query is lightweight & approximate ( needing only the min/max values for non partition column as joining key) , would we see benefit.