trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.15k stars 2.93k forks source link

Add support for lowValue and highValue statistics for TIMESTAMP type in engine #5860

Open losipiuk opened 3 years ago

losipiuk commented 3 years ago

Currenly lowValue, highValue statistics for TIMESTAMP type are not supported. See: https://github.com/prestosql/presto/blob/3c8cc38ce567cba6550b80c761e39fbe90a8df05/presto-main/src/main/java/io/prestosql/cost/StatsUtil.java#L40-L65

ericxiao251 commented 2 years ago

If no one is working on this, I would love to take a stab at it :).

ericxiao251 commented 2 years ago

Hey team I would like to work on this issue, but I would like some guidance on where to start.

I started by looking at the tests for similar column stats (date column stat tests) and I noticed that we do not have any tests for the timestamp column even though we do calculate some stats here for the timestamp.

I also noticed we do not have related data classes for timestamp columns, we have metastore.api.Date and metastore.api.DateColumnStatsData but there is no metastore.api.TimestampColumnStatsData class.


I think the first thing to address would to be add tests for the existing timestamp stats which require also adding a new class metastore.api.TimestampColumnStatsData?

losipiuk commented 2 years ago

The date tests you reference above are just for testing conversion between Hive's and Trino's statistics model. Hive uses DateColumnStatsData for that. For TIMESTAMP columns though Hive uses LongColumnStatsData (see here). Unit tests for that class are here.

We also have high-level tests regarding exposing statistics for TIMESTAMP columns for Hive. Those are here

The work on the feature would involve updating TestHiveTableStatistics. It should be minimalistic, I would expect just updating assertions, that actual min/max stats for timestamp columns are collected/returned. Then you need to add mapping both directions Trino->Hive, Hive->Trino. The mapping must be in-line with logic in StatsUtil. I would suggest looking how it is done for some other type and follow the pattern.