prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.75k stars 5.28k forks source link

Add support for incremental update of NDV values in Iceberg #21591

Open tdcmeehan opened 6 months ago

tdcmeehan commented 6 months ago

With #20993, we now have the ability to write NDVs according to the Iceberg spec. However, we are not leveraging an important feature of the storage of NDVs, which is their ability to be incrementally updated.

Expected Behavior or Use Case

Users of Iceberg should be able to update their NDV statistics without having to perform ANALYZE again after table updates.

Presto Component, Service, or Connector

Iceberg Connector

Possible Implementation

Example Screenshots (if appropriate):

Context

The main purpose for using Theta sketches in the NDV counts in Iceberg is the ability to incrementally update the sketches when updates occur. It is far more convenient to maintain existing statistics than to re-run ANALYZE when needed.

CC: @ZacBlanco

ZacBlanco commented 6 months ago

Theoretically, if we're going to support incrementally updating NDVs, updating other stats like min and max should also be trivial to add support for. IMO whatever design we come up with for this should probably consider incrementally updating all available statistics (especially histograms too when they become available!)

ethanyzhang commented 5 months ago

@tdcmeehan Do we have plans to get to this issue any time soon?

tdcmeehan commented 5 months ago

@yzhang1991 not really. This is low-priority, because actually in Lakehouse settings I believe just Trino actually writes the NDVs. Ideally, Spark would support this (and write them), which would greatly improve its availability.