Open jchappelow opened 3 months ago
Ok, this is a lot to digest, but I'll do my best to provide feedback and thoughts. Overall, this sounds like what we need, and I can 100% see how we could compute selectivity using a logical query plan.
Most of my concerns are regarding refreshing and persisting statistics deterministically, as well as the ability for intra-procedure updates.
I totally understand why we can't have full statistics refresh after every block, however I am unsure how we can ensure deterministic refresh with snapshots. Say we have a network with two nodes that prunes every 100 blocks. One node has been a validator for the entirety of the network, while the other joins at snapshot 100 and only begins playing transactions at block 101.
Say a schema has a statistics refresh at block 98, and then has some data written in block 99. How will the new node (which is starting at snapshot height 100) get the same statistics? I would normally assume that the persisted statistics are part of consensus (and thus replicated within snapshots), however is this possible given that they are not atomically committed with the rest of the block?
Total conjecture, but is there a way to handle this via the Postgres SPI, such that statistics are computed prior to being committed? Even if possible, this would obviously be a non-trivial departure from what you have now, but I'm just spit-balling.
Unfortunately I think it is probably inevitable that we will need both intra-block and intra-procedure statistics, since there is an obvious attack vector if we don't. We don't need them right now (since this is still a major improvement from the non-existent protection we have right now), but I do think it is inevitable. No further comments other than this, but just getting something working with fixed statistics for the lifetime of a block is adequate right now.
Related to https://github.com/kwilteam/kwil-db/issues/410, this issue describes the main tasks related to table statistics.
Purpose
The goal of statistics in a DB engine's query planner is to find an optimal (i.e. faster) plan in terms of the cost of the candidate plans. In Kwil, the goal is to price the execution of a transaction (or part of a transaction) fairly and defensively. Thus, we balance competing objectives:
Statistics and Selectivity
Statistics on data in user datasets is necessary to obtaining selectivity estimates for conditional clauses (e.g.
WHERE x < $1
) by providing the cost estimation code with a more accurate count of the number of rows that pass through any logical plans with the conditional as an input plan in the tree. Considered together with the expressions and data types, this should reflect the cost of execution. For more information, see https://www.postgresql.org/docs/current/row-estimation-examples.htmlSelectivity Overview
The general idea of selectivity is
rows = rel_cardinality * selectivity
, where the number of rows in a table isrel_cardinality
, and we wish to know the number,rows
, that would be selected by some condition withselectivity
, some number on [0,1]. Selectivity is thus based on some knowledge of the distribution of values in a column, and some provided value (e.g. in$1
in the lt binary operator above).NOTE: if we only wanted to achieve the first goal in the Purpose section (defensiveness), only the table's total row count (
rel_cardinality
) would need to be considered. Providing a fair cost given a query that processes or returns far fewer rows requires estimating selectivity and maintaining statistics to do that.Statistics
To produce meaningful selectivity estimates, we are defining column statistics with several key pieces of information:
default_statistics_target
with postgres.The MCVs and histogram together account for 100% of all rows observed.
MCVs
The most common value (MCV) set is:
MCVals []any
(or[]T
with generics) that contains the value of the column's type, in ascending orderMCFreqs []int
that contains the corresponding occurrence count of each of the values inMCVals
These have a hard capacity limit. In a full table scan, this capacity may be reached well before all values have been observed. I toyed with a multi-pass scan approach, but it was costly and complex. The approach use by PostgreSQL internally is to spill new values into a histogram when the MCV set is at capacity.
Histogram
A histogram is an array of counts, where each count is the number of times a value within a bin was observed. A bin has boundaries in the units of the column's type. PostgreSQL builds histograms where each bin has the same number of counts (bins have different widths), allowing it to omit the counts array. We are constrained by a single pass scan and the need to perform updates continually, so we use a traditional histogram that has uniform widths and different counts per bin. Rebalancing of the histogram is costly and only done in a full refresh of the statistics.
NOTE: "the histogram does not include the portion of the column population represented by the MCVs"
Defining a histogram's bounds involves interpolation for every supported type of value. The full range of the histogram is based off the values observed up to that point in the scan (when MCVs begin to spill). The first and last bin are catch-all bins for out-of-range values.
Computing Selectivity
It is best to read https://www.postgresql.org/docs/current/row-estimation-examples.html
In short, given a condition like
WHERE X < 42
, the combined observations in the MCVs and histogram give a reduced row count. Since theMCVals
array is sorted, the correspondingMCFreqs
are summed up to the location determined by the condition. However, the MCVs list may account for, say 40%, of all values in the table, while the histogram accounts for the rest, in this example 60%. The histogram selectivity and fraction are then considered together with the MCV selectivity.This is a basic example. The condition can be (in)equality, or a combination of conditions such as a range, etc. See the docs page linked above for more on other cases.
Building and Maintaining Statistics
Full Scan
A full scan of a table (i.e.
SELECT * FROM ... ORDER BY pk, ...
) is a method for building statistics from scratch. This is the "ground truth" statistics for a table. We may consider a postgres extension that uses SPI to do this and avoid protocol overhead incurred by doing this in the application.When a new dataset is created, tables are empty and this is not needed. Where a full scan is applicable is periodic refresh/rebuild of table statistics, and perhaps recovery. See the next section on Updates for an explanation of why periodic refresh is needed.
Continuous Updates
All changes (insert/update/delete) to a table are captured in the logical replication monitor, and these change sets are used to continuously update a table+column's statistics.
Over time, the statistics can become poorer representations of the table than if a full rescan were performed. Some simple examples:
Continuous updates are relatively fast and cheap, but may lead to stale statistics.
Periodic Refresh
To address the above issues with continuous updates leading to stale statistics, datasets are scheduled for periodic refresh of the statistics.
The scheduling is presently based on: (1) current height, (2) the dataset's dbid, and (3) a constant modulo value e.g. 20. This determines if a dataset should have its table statistics rebuilt at a given height. This staggers the rescans, rather than rebuilding all datasets' statistics at the same time. It is also deterministic.
NOTE: PostgreSQL does periodic statistics rebuilds as part of it's "autovacuum" process. AFAIK, it does no continuous updates. The frequency of autovacuum is affected by many factors, and is not predictable between deployments.
Persistence of Statistics
Since the statistics are not rebuilt at every height, and because continuous updates can lead to a drift in the statistics from the ground truth, a node restart cannot simply begin by rebuilding statistics with a full scan. We must store each table's current statistics so that a node will start with exactly the same statistics it had on shutdown (or crash). All nodes will have the same view of statistics (and thus query cost) in this event.
Statistics are polymorphic. This makes it very cumbersome to store the stats in the relational DB. In postgresql, there is a concept of an
anyarray
, which is used e.g. in the MCVs array. To applications, this concept can only be used as a function arg, not a column type. I considered several possible solutions with tables like creating a stats schema and a table for each col likeds_<dbid>_stats.rel_col
(min T, max T, mc_vals T[], mc_freqs int[], ...). Ultimately I could not discern any benefit to that effort, only complexity and possible interference with the DB's intended purpose.Implemented. Given the above, I opted for a very simple and effective approach: implementing
encoding.BinaryMarshaler
andencoding.BinaryUnmarshaler
for all data types used to represent column statistics, and then using theencoding/gob
package to create a blob that contains the statistics that is only understood (or useful) to thekwild
application. This can be stored as abytea
in a special statistics table with the fully-qualified table name as the primary key. For now, I'm writing this blob to a file on disk. It is stored at the end of every block. The main down side is that their storage is not atomic with the consensus DB commits, so there are consequences of an ill-timed crash that would require a state rebuild.Updates within DB transaction (between statements)
This needs investigation. I have a list of ideas including audit tables. I'm working on all the above first.
Computing cost for statements within procedures
The query planner, access to statistics, and the cost model that combines both need to be usable within a procedure.
As a staring point, we can compute cost for all of a procedures SQL statements prior to executing the procedure. Ultimately if we want to consider account for cascading affects from one statement to the next, we need a better solution.