Add support for chunk column statistics tracking

nikkhils commented 3 weeks ago

Allow users to specify that ranges (min/max values) be tracked for a specific column using the enable_column_stats() API. We will store such min/max ranges in a new timescaledb catalog table _timescaledb_catalog.chunk_column_stats. As of now we support tracking min/max ranges for smallint, int, bigint, serial, bigserial, date, timestamp, timestamptz data types. Support for other stats for bloom filters etc. will be added in the future.

We add an entry of the form (ht_id, invalid_chunk_id, col, -INF, +INF) into this catalog to indicate that min/max values need to be calculated for this column in a given hypertable for chunks. We also iterate through existing chunks and add -INF, +INF entries for them in the catalog. This allows for selection of these chunks by default since no min/max values have been calculated for them.

This actual min-max start/end range is calculated later. One of the entry points is during compression for now. The range is stored in start (inclusive) and end (exclusive) form. If DML happens into a compressed chunk then as part of marking it as partial, we also mark the corresponding catalog entries as "invalid". So partial chunks do not get excluded further. When recompression happens we get the new min/max ranges from the uncompressed portion and try to reconcile the ranges in the catalog based on these new values. This is safe to do in case of INSERTs and UPDATEs. In case of DELETEs, since we are deleting rows, it's possible that the min/max ranges change, but as of now we err on the side of caution and retain the earlier values which can be larger than the actual range.

We can thus store the min/max values for such columns in this catalog table at the per-chunk level. Note that these min/max range values do not participate in partitioning of the data. Such data ranges will be used for chunk pruning if the WHERE clause of an SQL query specifies ranges on such a column.

A "DROP COLUMN" on a column with a statistics tracking enabled on it ends up removing all relevant entries from the catalog tables.

A "decompress_chunk" on a compressed chunk removes its entries from the "chunk_column_stats" catalog table since now it's available for DML.

Also a new "disable_column_stats" API has been introduced to allow removal of min/max entries from the catalog for a specific column.

svenklemm commented 3 weeks ago

Hmmm i'm not a fan ot the dimension wording, did we have a discussion yet how we want to name this thing?

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 82.90441% with 93 lines in your changes missing coverage. Please review.

Project coverage is 81.80%. Comparing base (59f50f2) to head (f7bfc0a). Report is 241 commits behind head on main.

Files	Patch %	Lines
src/ts_catalog/chunk_column_stats.c	84.48%	35 Missing and 30 partials :warning:
src/hypertable_restrict_info.c	66.03%	17 Missing and 1 partial :warning:
src/dimension.c	0.00%	5 Missing :warning:
src/chunk.c	92.59%	0 Missing and 2 partials :warning:
src/process_utility.c	90.47%	0 Missing and 2 partials :warning:
src/chunk_adaptive.c	88.88%	0 Missing and 1 partial :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #7020 +/- ## ========================================== + Coverage 80.06% 81.80% +1.73% ========================================== Files 190 201 +11 Lines 37181 37900 +719 Branches 9450 9845 +395 ========================================== + Hits 29770 31005 +1235 + Misses 2997 2943 -54 + Partials 4414 3952 -462 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

nikkhils commented 3 weeks ago

Hmmm i'm not a fan ot the dimension wording, did we have a discussion yet how we want to name this thing?

@svenklemm no further discussion. I am trying to name it for what it does, track a column's min/max range. The word dimension in English means a measurable value.

akuzm commented 3 weeks ago

Hmmm i'm not a fan ot the dimension wording, did we have a discussion yet how we want to name this thing?

In my imagination, the Correct Technical Term for this is "chunk-level column metadata", complementary to the "batch-level column metadata" in native compression and the "Parquet row group metadata" in CNS. Not sure it's a good user-facing term though.

The word "range" itself is probably not good, because it's very likely that we're going to add the bloom filters.

akuzm commented 3 weeks ago

Hmmm i'm not a fan ot the dimension wording, did we have a discussion yet how we want to name this thing?

In my imagination, the Correct Technical Term for this is "chunk-level column metadata", complementary to the "batch-level column metadata" in native compression and the "Parquet row group metadata" in CNS. Not sure it's a good user-facing term though.

The word "range" itself is probably not good, because it's very likely that we're going to add the bloom filters.

Arguably, "metadata" is too generic and can mean anything at all. Maybe "chunk-level column statistics", but this clashes with the postgres statistics used for planning.

For batch-level, we can call it a "sparse index" which conveys the function of excluding the compressed batches very well, but this name doesn't really translate to chunks...

nikkhils commented 3 weeks ago

Arguably, "metadata" is too generic and can mean anything at all. Maybe "chunk-level column statistics", but this clashes with the postgres statistics used for planning.

For batch-level, we can call it a "sparse index" which conveys the function of excluding the compressed batches very well, but this name doesn't really translate to chunks...

@svenklemm @akuzm How about chunk_restrictinfo as the name for the catalog instead of the dimension_range? Essentially, we are trying to add metadata which helps in eventual chunk exclusion. But then the API changes to

add_chunk_restrictinfo or remove_chunk_restrictinfo. Not sure if that's too much to expose to a user.

Or maybe, chunk_excludeinfo

To keep it simple, we could even call the catalog chunk_column_stats. That keeps it very generic and user friendly IMO.

nikkhils commented 2 weeks ago

Arguably, "metadata" is too generic and can mean anything at all. Maybe "chunk-level column statistics", but this clashes with the postgres statistics used for planning. For batch-level, we can call it a "sparse index" which conveys the function of excluding the compressed batches very well, but this name doesn't really translate to chunks...

@svenklemm @akuzm How about chunk_restrictinfo as the name for the catalog instead of the dimension_range? Essentially, we are trying to add metadata which helps in eventual chunk exclusion. But then the API changes to

add_chunk_restrictinfo or remove_chunk_restrictinfo. Not sure if that's too much to expose to a user.

Or maybe, chunk_excludeinfo

To keep it simple, we could even call the catalog chunk_column_stats. That keeps it very generic and user friendly IMO.

FINAL decided names:

catalog: chunk_column_stats APIs: enable_column_stats/disable_column_stats

nikkhils commented 1 day ago

@erimatnor @akuzm @svenklemm @mkindahl Added a bunch of additional tests in the latest force push.

timescale / timescaledb

Add support for chunk column statistics tracking #7020

Codecov Report