Open tdcmeehan opened 6 months ago
Theoretically, if we're going to support incrementally updating NDVs, updating other stats like min
and max
should also be trivial to add support for. IMO whatever design we come up with for this should probably consider incrementally updating all available statistics (especially histograms too when they become available!)
@tdcmeehan Do we have plans to get to this issue any time soon?
@yzhang1991 not really. This is low-priority, because actually in Lakehouse settings I believe just Trino actually writes the NDVs. Ideally, Spark would support this (and write them), which would greatly improve its availability.
With #20993, we now have the ability to write NDVs according to the Iceberg spec. However, we are not leveraging an important feature of the storage of NDVs, which is their ability to be incrementally updated.
Expected Behavior or Use Case
Users of Iceberg should be able to update their NDV statistics without having to perform
ANALYZE
again after table updates.Presto Component, Service, or Connector
Iceberg Connector
Possible Implementation
Example Screenshots (if appropriate):
Context
The main purpose for using Theta sketches in the NDV counts in Iceberg is the ability to incrementally update the sketches when updates occur. It is far more convenient to maintain existing statistics than to re-run
ANALYZE
when needed.CC: @ZacBlanco