prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.02k stars 5.36k forks source link

Constraint Support and Optimizations #16413

Closed simmend closed 2 years ago

simmend commented 3 years ago

This Github issue describes the design of the constraint support for PrestoDB. The original document is at https://docs.google.com/document/d/1h9C7dJck2PFPtvhksUCB74082zIn9sAZG1jNlQ7kqaU/edit. For readers convenience the content is copied here.

1. Background

Constraints are rules that apply to rows and columns of a database schema to ensure the integrity of the data stored under the schema [1]. For example, a predicate check constraint might ensure that each employee record has a salary exceeding a certain minimum threshold. A primary key constraint might ensure that all employee records have a unique employee identifier. In practice, a database schema defines various types of constraints. These include predicate checks, primary keys, unique keys, foreign keys, and functional dependencies. 

Query optimization strategies that exploit constraints can lead to orders of magnitude improved query execution plan performance [2]. For example, unique key constraint optimizations can eliminate redundant distinct operations [3] or a redundant distinct attribute from an aggregation function that might otherwise require an inefficient mark-distinct execution plan. Foreign key constraint optimizations can help to eliminate redundant joins [4] or enable the determination of a lossless join that allows a top n or aggregation operation that could be pushed through the join [5]. A modern query optimizer must have constraint optimization capabilities as queries in practice are often generated by tools or naive users and can therefore be unnecessarily complicated and poorly written. Consequently, the optimizer must be relied upon to eliminate redundancies and other inefficiencies.  

A database system might enforce constraints when data is inserted into a table or updated. A query optimizer can always trust enforced constraints [6]. However, enforced constraints add validation overhead to insert, update, and delete operations. Conversely, an application might enforce constraints and then declare them to the database system as informational constraints so that the optimizer can exploit them [7]. Virtually all modern database systems support enforced constraints, informational constraints, or both. Hive 3.1.2 catalog capabilities allow for the definition of informational primary key and unique constraints on Hive tables [8,9]     The PrestoDB optimizer needs general-purpose support for constraint optimization. This support should manifest as an extensible optimization framework for associating logical properties with the result table produced by a plan node that optimization rules then exploit to generate more efficient plans. These logical properties might initially derive from primary key and unique key constraints defined on Hive tables. Logical properties derived from key constraints would then be augmented and refined by the grouping, limiting, predicate application, and other operations performed by plan nodes. New optimization rules would exploit these logical properties to discover and remove unnecessary distincts, sorts, and other redundant query operations. Future work would extend this optimization framework with support for logical properties derived from referential integrity constraints, functional dependencies, order dependencies, and other types of constraints. This future work would also extend the optimizer with new optimization rules that exploit these properties. 

2. Externals

The primary goal of this feature is to extend PrestoDB with constraint optimization capabilities. Hence, this feature adds read-only support for constraints defined directly on Hive tables and defers the ability to define table constraints via PrestoDB to follow-on work. That follow-on work requires careful consideration of how to define and validate constraints that generalize across a wide variety of backend systems. Section 5.4 discusses the various considerations. This feature extends PrestoDB with two small external changes: 1) explain is enhanced to display the key constraints defined for an accessed table; 2) new session variable is added to enable constraint optimization to be enabled and disabled. The remainder of this section provides a brief description of how Hive defines constraints. It then briefly describes these new externals.

image image

2.1 Constraints in Hive Metastore

Hive defines key constraints as illustrated in figure 1. In general, constraints are defined either in-line with a column definition or out-of-line after all column definitions. Out-of-line constraint definitions can include an optional name. Constraints can have an optional set of properties that define validation and query optimization properties. The ENFORCED and NOT ENFORCED attributes are part of the ANSI SQL standard and determine, respectively, whether Hive enforces the constraints or not. The ENFORCED option is currently not supported by Hive. The RELY and NORELY attributes indicate, respectively, whether an optimizer can reliably exploit the constraints or not. These attributes are extensions to the standard supported in some form by various database vendors. The ENABLE, DISABLE, VALIDATE, NOVALIDATE attributes provide for Oracle compatibility. The combination of ENABLE and VALIDATE is equivalent to ENFORCED. The PrestoDB optimizer will exploit constraints where either ENFORCED or RELY is true. As is illustrated by the employee table definition in figure 1, there can be a single primary key constraint and multiple unique constraints defined per table. Since both key constraints include the RELY attribute, they are eligible for constraint optimization.
The Hive Metastore via the client methods shown in figure 2 [9] are used at compile time to retrieve constraints defined for a table. These methods are available as of version 3.0. Retrieved constraints map to an internal representation added to TableConnectorMetadata. This internal representation is subsequently passed as an argument to TableScanNode, propagated as LogicalProperties, and exploited by new optimization rules, as described in subsequent sections.

image

2.2 Explain Extensions

Explain displays constraints defined for a table via the TableScan node output. Figure 3 illustrates Explain output for a query that accesses the employee table in figure 2. Providing table constraint visibility via SHOW CREATE TABLE will be deferred until the CREATE and ALTER TABLE work is undertaken in follow-on work as discussed in section 5.4.

2.3 Disabling Constraint Optimization

This feature extends the optimizer with a new logical properties framework and optimization rules for constraint optimization. A new session variable exploit_constraints allows logical property propagation, and hence constraint optimization, to be enabled and disabled. The default value for this session variable is true. The statement set session exploit_constraints = false can be issued to disable constraint optimization.

3 High Level Design

This section provides a high-level overview of the logical properties framework added in support of constraint optimization. It also overviews the logical properties and optimization rules added in support of key constraints. Section 4 provides additional design details.

3.1 Logical Properties Optimization Framework

Figure 4 illustrates the logical properties optimization framework added in support of constraint optimization. 

image

The framework is exclusively associated with iterative optimization. Logical properties are computed for each plan node when inserted into a memo group. The computation of logical properties effectively occurs in a bottom-up fashion as plan nodes are inserted into the memo structure. The logical properties of table access nodes derive from constraints defined on store tables. Other nodes propagate and augment properties of their source nodes. For example, a distinct node would add a unique key to the logical properties propagated from its source. In general, the logical properties of the result table produced by a plan node are a function of the logical properties of its source and the particular operation performed by the plan node.

All sub-plans of a memo group can share the same logical properties as they produce an equivalent result table. Hence, the first plan node added to a group computes and caches logical properties for the group. The current iterative optimizer implementation does not yet keep alternative plans for a group. However, as cost-based optimization capabilities evolve, a group will indeed have multiple members. The high-level design for constraint optimization aligns with this direction. A plan node is inserted into the memo structure as a group reference. Logical properties of the group are referred to via this reference. Consequently, a parent node can find the logical properties of its source nodes via the group reference, as can optimization rules that exploit these properties. Logical property computation is enabled for a given optimizer by passing a logical properties provider. The logical properties provider includes plan node-specific property computation methods. Figure 4 provides a pseudo-code for logical property computation that illustrates how the Filter node in the example computes its logical properties upon insertion into group 2. The node first retrieves its source properties from group 1 via the group reference and invokes the Filter-specific logical properties provider method on those source properties. Logical property computations perform analysis of a plan node’s source properties and argument expressions. Section 4 gives more information on the logical properties provider and the various analysis performed to propagate logical properties associated with key constraints.

Iterative optimizer rules use logical properties to determine their applicability and result transformation. The properties supported in the first phase of this work derive from key constraints. Optimizer rules can exploit these properties to remove redundant distinct, sort, top, and other operations. Figure 4 provides an illustrative pseudo-code for an optimization rule that consults logical properties to determine an unnecessary distinct operation for removal. This rule uses logical properties to determine when grouping key attributes of the aggregation node of group 5 are already a key in the table produced by the node's source. If this is the case, the aggregation node can be removed.

3.2 Logical Properties Supporting Key Constraints

The three logical properties added in support of key constraints are the key property, maxcard property, and equivalence class property. This section provides a high-level overview of these properties and their relationships. Section 4 gives further details on how these properties are maintained and propagated through various plan operations.

3.2.1 Key Property

A key constraint or key is a set of attributes that uniquely identify a record in a stored table or the result table produced by a plan node. A table can have more than one key. For example, employees might be uniquely distinguishable by either their employee id or social security number. In this case, corresponding attributes of a table that holds employee records would declare each as a primary or unique key of the table. Keys can have more than one attribute. The key property maintains the set of keys that hold for an associated result table.

The key property determines when a key requirement is satisfied. A key requirement is a set of attributes for which a key constraint must hold for a plan node result table. Key requirements can arise in various query processing contexts. For example, a distinct keyword is added to a query to ensure that a result table is free of duplicate values. The selected query attributes form a key requirement that must be enforced by the query execution plan. Enforcement might require the query optimizer to perform an expensive distributed sorting or hashing operation. If the optimizer detects that the key requirement is already satisfied by an existing key, it can avoid adding this costly operation. In general, a key in a key property satisfies a key requirement if its attributes are a subset of the key requirement's attributes. As a simple example, the distinct operation in the query “select distinct c1, c2, c3 from T” is redundant and is not enforced by an execution plan if the attributes c1, c2 are declared as a unique key.

The key property is initialized straightforwardly from constraints declared for stored tables. Plan nodes that enforce a distinct or that perform aggregation contribute additional keys. Propagation of the key property requires sophisticated analysis of the effects of predicate application, projections, joins, aggregations, and other query operations that comprise an execution plan. Section 4.4 provides further details on how the key property is maintained and propagated through such query operations.

3.2.2 Maxcard Property

The maxcard property maintains the maximum provable cardinality of rows in a result table. The maxcard property is either set to a whole number or is unknown. The maxcard property is related to the key property. It is set to one when all attributes of a key become bound to constants through predicate application. When this one record condition is detected, the key property becomes trivial and its keys are removed, saving further maintenance overhead. Optimizer rules can exploit a one record condition to eliminate sorts, aggregations, and other operations that become trivially redundant when there is at most one record in a result table. Limit or Values nodes can also set the maxcard property. In these cases, it can take on a value greater than one. Maxcard maintains the minimum number of rows to which a result set is constrained. Section 4.4 provides further details on how the maxcard property is maintained and propagated through various plan node types.

3.2.3 Equivalence Class Property

The equivalence class property maintains subsets of attribute references and constants that are made equivalent through predicate application. Equivalence classes are central to logical property maintenance and exploitation. Recall that a key satisfies a key requirement if its attributes are a subset of the key requirement's attributes. The test would fail for the key requirement {C, D} and key {A, B}. However, the test should succeed after conjunct predicates A=B and B=C are applied. Equivalence classes facilitate this determination. The process involves substituting each attribute of the key and key requirement with their equivalence class head before performing the subset test. The equivalence class head is a representative member of an equivalence class. Assume that attribute A is the equivalence class head of equivalence class {A, B, C} that results from applying the predicates. The key would be rewritten as {A, A} after equivalence class head substitution and more succinctly as {A} after removing the duplicate attribute reference. Furthermore, the key requirement becomes {A, D} after equivalence class head substitution. It is trivial to determine that the key satisfies the key requirement via a straightforward subset test after both are written in this canonical form via equivalence class head substitution.

As illustrated by the previous discussion, equivalence classes help enable the efficient representation of the key property as duplicate attributes can be readily detected and removed. It is possible to simplify keys further by removing attributes that become bound to constants via predicate application. For example, if attributes {A, B} form a key, and the equality conjunct B=10 is applied, then attribute {A} is sufficient to determine uniqueness after predicate application as all values of B are identical and of no use in distinguishing uniqueness. Hence, removing B further simplifies the key. Equivalence class head substitution facilitates the detection of bound attributes as constants are consistently chosen as equivalence class head. A key can become empty after the removal of constant and duplicate attributes. This situation occurs when all attributes of a key become bound to constants, signaling a one record condition. The maxcard property is set to one and the key property is emptied when this occurs.

After the application of predicates and removal of duplicate and bound attributes, one or more keys in the key property can become redundant relative to others. For example, assume a key property contains key {A} and key {B, C}. Assume further that conjunct predicate A=B is applied. After predicate application, key {B, C} can be rewritten as {A, C} making it redundant relative to key {A} as the latter satisfies any key requirement the former does. The process of substituting attributes of keys in the key property with their equivalence class heads and removing duplicate attributes and constants and superfluous keys is called normalization. Normalization is central to efficient key property maintenance and also to key property exploitation. A key property and key requirements are both transformed into canonical representations using normalization before performing the test to determine if any key is a subset of a key requirement. Equivalence classes will continue to play a central role in constraint optimization as referential integrity constraints, functional dependencies, and other logical properties will leverage them to achieve a concise canonical representation.

3.3 Optimization Rules Exploiting Key Constraints

Various classes of optimization rules can exploit key constraints. One class of rules eliminates redundancies that can occur in generated or poorly written queries. The section describes a few such optimizations.

3.3.1 Distinct Elimination

An aggregation node that performs duplicate removal can be removed from an execution plan if its grouping key attributes are already a key in the table produced by its source operator. Figure 4 illustrates a pseudo-code for such a rule. The rule first casts the source node to a group reference and retrieves its logical properties. It then uses the logical properties to test if the grouping key is already a key. If so, the rule returns the source operator as output. The optimization can yield orders of magnitude more efficient execution plans as distributed aggregation is an expensive batch operation, possibly requiring a data shuffle. The sub-plan input to the aggregation operator might be quite complicated. The logical property implementation employs sophisticated techniques to propagate key constraints through joins, projections, filters, and other operations. Section 4 discusses property propagation in greater detail.

3.3.2 Distinct Aggregate Elimination

An aggregation node that performs a group by operation can specify the elimination of duplicate values from aggregation function calculations. This can add significant computational overhead to the function calculation, as is the case when a mark distinct execution strategy is needed to handle multiple such functions. Duplicate removal can be avoided for a particular aggregation function when the grouping key attributes and aggregate function argument attributes combine to form a key in the result table produced by the source of the aggregation node. This test is performed in turn for each aggregate function. It follows that if the grouping key attributes alone are already a key then duplicate elimination is unnecessary for all aggregate function calculations as there is guaranteed to be only a single row per group.

3.3.3 Limit, TopN Elimination

Limit and TopN operations can be removed from an execution plan when the maxcard of the source node’s result table is known and is less than or equal to the specified limit constraint.

3.3.4 Batch Operation Elimination

Batch operations, such as those implemented by Sort, TopN, or DistinctLimit (see aside) nodes, can be removed from an execution plan when it can be proven that the result set produced by their source node has at most a single row. This is true when the logical properties of the source signal a one record condition.

Aside: Ideally, any distinct or limit operation redundancy would be dealt with separately, prior to the firing of the rule that merges the distinct and limit into a distinct-limit operation. Unfortunately, that rule sits ahead of the rule that translates original expressions to row expressions. The logical property optimization framework deals exclusively with row expressions and cannot be moved ahead of that rule. As the expression translation rule moves forward in query compilation, rule reordering will ensure that a distinct-limit operation is not formed when either the distinct or limit operation is redundant. For now, we catch the case where the distinct-limit operation is redundant due to a single row source rather than undo the merge of the distinct and limit operations.

4 Design Details

This section discusses various design details of the logical properties optimization framework that will assist a reviewer’s understanding of the implementation.

4.1 Logical Property Optimization Framework Interfaces and Classes

This section provides a brief overview of important classes and interfaces involved in the logical property optimization framework implementation. The associated Javadoc gives additional details.

4.1.1 LogicalProperties and LogicalPropertiesImpl

Logical properties encapsulate constraints that hold for a final or intermediate result table produced by a plan node. As illustrated in Figure 4, they are associated with a Memo group as all sub-plans in a group produce a logically equivalent result table. In this first iteration of constraint optimization, the LogicalProperties interface exposes methods that can be queried by optimization rules in order to determine when redundant operations can be eliminated because of key constraints. Figure 4 shows a pseudo-code for an optimization rule that queries the interface to determine when an AggregationNode that enforces uniqueness is unnecessary. The optimization rule tests the LogicalProperties interface of the node’s source to determine if attributes of its grouping key argument are already satisfied by a key in the source’s result table. As another example, an optimization rule tasked with removing an unnecessary SortNode would query the interface to determine if a one record condition holds for its source’s result table. As new types of constraints are supported, the LogicalProperties interface will expose additional methods that can be exploited by optimization rules. For example, when referential constraints are supported, the interface will expose methods that can be used by optimization rules concerned with pushing operations through lossless joins. Section 5 discusses such future extensions.

LogicalPropertiesImpl provides an implementation for the LogicalProperties interface. In this first phase of constraint optimization specific to key constraints, each instance encapsulates KeyProperty, MaxCardProperty, and EquivalenceClassProperty instances. Section 3.2 gave an overview of these properties and their relationships. LogicalPropertiesImpl also supplies a set of builders for propagating these properties through predicate application, projection, join, and other query operations. Section 4.4 discusses the propagation of each of these properties in greater detail. LogicalProperties and LogicalPropertiesImpl are defined in different modules because of project dependencies. Section 4.2.1 discusses this particular issue in greater detail.

4.1.2 LogicalPropertiesProvider and LogicalPropertiesProviderImpl

The LogicalPropertiesProvider interface defines a suite of plan node-specific methods for the computation of logical properties. It specifies a default implementation that produces an empty set of logical properties, and additionally, a suite of plan-node specific overrides of the default implementation. The LogicalPropertiesProviderImpl class provides an implementation of this interface. It implements the plan-node specific methods for logical property computation using the previously described property propagation builders supplied by LogicalPropertiesImpl. The LogicalPropertiesProvider mechanism enables a plan node to receive its logical property compute capabilities via dependency injection. Section 4.2.1 discusses the rationale for this design decision.

A plan node computes its logical properties upon insertion into a Memo group. Figure 4 illustrates how a FilterNode performs this task using the LogicalPropertiesProvider. The implementation of the FilterNode’s computeLogicalProperties method invokes the getFilterProperties method of the LogicalPropertiesProvider, passing its source’s logical properties as an argument. The logical properties of the source are retrieved by casting it to a GroupReference and retrieving the group’s properties via the reference there. As the FilterNode produces its result table by applying predicates to its source table, the implementation of getFilterProperties exploits builder classes in LogicalPropertiesImpl that determine the effects of predicate application on the KeyProperty, MaxCardProperty, and EquivalenceClassProperty. Computing logical properties upon Memo insertion effectively causes them to be computed in a bottom-up fashion as the insertion of a parent occurs after insertion of its child nodes. LogicalPropertiesProvider interface and LogicalPropertiesImpl are defined in different modules due to project dependencies. Section 4.2.1 discusses this issue further.

4.1.3 KeyProperty, MaxCardProperty, EquivalenceClassProperty

Instances of the KeyProperty, MaxCardProperty, and EquivalenceClassProperty represent constraints that hold for a result table produced by a query plan operator. Section 3.2 provided an overview of these properties, their relationships, and their use in key constraint optimization. Section 4.4 details invariants maintained for these properties, as well as how they are initialized and propagated through various types of query operations.

4.2 Logical Properties Optimization Framework Considerations

This section discusses a few design details related to how the logical properties optimization framework is integrated into PrestoDB.

4.2.1 Dependency Injection

Constraint optimization computes logical properties that hold for the result table produced by a plan node. The computation involves the analysis of a plan node's argument expressions. PrestoDB distributes plan nodes across two modules: presto-spi and presto-main. Argument expression classes reside in presto-main. Classes in presto-spi are visible to those in presto-main; however, the inverse is not the case. Consequently, it is not feasible to put a plan node's logical property computation code in an instance method as the expression classes referenced by that code are not visible to plan nodes in presto-spi. Consequently, the logical property computation code is built in presto-main and injected into a plan node's instance method at optimization time. The LogicalPropertyProvider and LogicalPropertyProviderImpl described in section 4.1.2 serve this purpose. They provide a suite of plan node-specific methods provided for logical property computation. The interface LogicalPropertyProvider sits in presto-spi as it includes methods for plan nodes located in both modules. The implementation LogicalPropertyProviderImpl is built-in presto-main as it comprises the logic for analyzing a plan node’s argument expressions. It is supplied to a plan node's logical property computation method when it is invoked at optimization time.

4.2.2 Iterative Optimization Integration

As described in section 3.1, the logical properties optimization framework is integrated exclusively with the iterative optimizers. The framework leverages the iterative optimizer planning model to compute logical properties in a bottom-up fashion. It also leverages the Memo structure to associate logical properties with groups of equivalent plans. Visitor optimizers are not extended with a logical properties capability as they do not fit a uniform planning model. Moreover, a long-term design goal is to remove visitor optimizers over time in favor of iterative optimizers.

4.2.3 Expressions

Logical property propagation performs analysis of a plan node’s argument expressions. The argument expressions analyzed include assignments, predicates, grouping keys, and other types of arguments. The LogicalPropertyProviderImpl provides plan node-specific methods that do this analysis. The current implementation of these methods deals exclusively with row expressions. Consequently, plan nodes with argument expressions written in terms of original expressions will not have their logical properties propagated as expected. Therefore, an iterative optimizer instance with rules looking to exploit logical properties must follow the optimizer that translates original expressions to row expressions. Future work should focus on moving expression translation earlier into the query compilation process, as opposed to extending logical property propagation with support for original expressions.

4.4 Property Propagation

This section details how the KeyProperty, MaxCardProperty, EquivalenceClassProperty are propagated through various query plan operations. It describes propagation in terms of predicate application, projection, joins, and other query operations rather than in terms of plan nodes as the latter might perform one or more operations. For example, a table scan node might initialize properties, apply predicates, and perform a projection. The KeyProperty and EquivalenceClassProperty are constraints on a plan node’s result table attributes.

4.4.1 Invariants

The following invariants are maintained by the KeyProperty, MaxCardProperty, and EquivalenceClassProperty as they are propagated through various query operations.

KeyProperty Invariants

As described in section 3.1, the KeyProperty employs normalization to maintain a concise canonical form. Normalization enforces the following invariants. Key attributes are substituted with their equivalence class head. Constant and duplicate attributes are removed. Keys whose attributes are a superset of another key are removed.
The EquivalenceClassProperty is central to normalization. Consequently, operations like predicate application and projection that might change the EquivalenceClassProperty trigger normalization of the KeyProperty to ensure a canonical form. The KeyProperty maintains these additional invariants. Only interesting keys are included. A key is interesting if all of its attributes are referenced in a plan node’s output. Hence, the projection of any of a key’s attributes renders the entire key uninteresting. The KeyProperty excludes uninteresting keys.

It is emptied when MaxCard is set to one. This situation can occur when a key becomes bound to constants via normalization; however, it can also happen when a final limit or top operation restricts a result to a single record. Keys are trivial and superfluous information for a result set constrained to a single row.

EquivalenceClassProperty Invariants

The EquivalenceClassProperty is maintained in a concise canonical form that satisfies the following invariants.

Constants are always chosen as the equivalence class head. If there are no constants in a given equivalence class an arbitrary attribute is selected as the head. Singleton equivalence classes are not stored. Returning the equivalence class head of an attribute or constant of an unstored singleton class is an identity operation that returns the original attribute or constant.

Only interesting non-constant attributes are stored. That is, non-constant attributes are projected from an equivalence class. Projection of the head causes a new head to be selected. Moreover, projection can cause an equivalence class to reduce to a singleton class that is not stored.

MaxCardProperty Invariants

The MaxCardProperty maintains the following invariant. The tightest possible cardinality constraint is stored. So, for example, if a MaxCardProperty instance was set to one because a key became bound, it could not be set to a larger value by a limit or top operation.

4.4.2 Property Initialization

Leaf plan nodes are responsible for initializing logical properties. A TableScanNode initializes logical properties from constraints defined by the accessed table's catalog. These constraints are provided via connector metadata and represent attributes in terms of column handles. In this first iteration of constraint optimization, each unique and primary key constraint defined for a table adds a key to the key property. The mapping of key constraints to keys involves translating column handles to variable reference expressions. A TableScanNode initializes the maxcard property to unknown and the equivalence class property to empty, which implies that all attributes and constants referenced in the query are effectively in a singleton equivalence class. A ValuesNode initializes the MaxCard property to the number of rows in the table it represents, and sets all other properties to empty.

4.4.3 Predicate Application

The application of conjunct predicates that equate attributes and constants can change the equivalence class property. When equivalence classes change, specifically when equivalence class heads change, the key property must be normalized so that it’s canonical form is maintained in alignment with these changes. Normalization of the key property can set the maxcard property to one, which occurs when it detects that a key has become fully bound.

4.4.4 Projection

Projection operations cause the elimination of uninteresting keys and equivalence class attributes as per the invariants described in section 4.4.1. It can also cause an attribute context change that requires reassignment of key and equivalence class attributes. If a key attribute does not have an assignment in the new attribute context, it is mapped to the assignment of an equivalent attribute whenever possible. For example, assume A is a key attribute and there is no new assignment for A. Assume further that A and B are in the same equivalence class and there is an assignment from B to B’. Consequently, A can be assigned to B' rather than get projected. Projection of the equivalence class property is followed by key property normalization as it can cause equivalence class heads to change. The maxcard property is not affected by projection operations.

4.4.5 Aggregation

Aggregation operations, such as those performed by an AggregationNode or DistinctLimitNode, form a new key from the attributes of their key arguments. This new key is added to the source key property, which is then normalized. Normalization can cause the setting of the maxcard property to one. This situation occurs when all key argument attributes are bound to constants. Setting maxcard to one causes a one record condition, which results in emptying of the key property. Finally, an aggregation operation projects key and equivalence class properties using its output attributes.

4.4.6 AssignUniqueId operations

An AssignUniqueId operation generates a unique key, This key is added to the key property of the source. All other source properties are propagated unchanged.

4.4.7 Limit

Final limit and final top n operations update the source maxcard property with the value of their count argument. This update will only change the source maxcard property if the value is smaller than the current property setting. A one record condition occurs if the count is one, which results in emptying of the key property.

4.4.8 Joins

Propagation of the source properties of the join requires a sophisticated analysis of the characteristics of the join.

Key and MaxCard Properties

An inner or left join propagates the key property and maxcard property of the left source if the join is n-to-1, meaning that each row of the left source matches at most one row of the right source. Determining that a join is n-to-1 involves forming a key requirement from the equi-join attributes of the right table and querying the logical properties of the right table to determine if those attributes form a unique key. Semi-joins are inherently n-to1. Conversely, an inner or right join can propagate the key property and maxcard property of the right source if the join is 1-to-n. If an inner join is 1-to-1, which is the case when it is both n-to-1 and 1-to-n, then it follows from the above that the key property of the join result comprises the union of the left source keys and right source keys.

If an inner join is instead m-to-n, meaning that it is neither n-to-1 nor 1-to-n, the key property of the join is formed by concatenating the left source and right source key properties. Concatenating two key properties forms a new key for every possible combination of keys. For example, if key property KP1 has key {A} and key {B,C} and key property KP2 has key {D} and key {E} then concatenating KP1 and KP2 would yield a key property with keys {A,D}, {A,E}, {B,C,D} and {B,C,E}. An m-to-n inner join propagates the product of the left source. Full outer joins do not propagate source key properties as they can inject null rows into the result.
An m-to-n inner join or outer join propagates the product of the left and rightMaxCardProperty if both values are known.

EquivalenceClass Property

The equivalence class property of an inner or left join adds the equivalence classes of the left source. The equivalence class property of an inner or right join adds the equivalence classes of the right source.

The equivalence class property of an inner join is then updated with any new equivalences resulting from the application of equi-join predicates, or equality conjuncts applied as filters. It follows from the above that inner joins combine the left and right source equivalence classes and that full outer joins do not propagate equivalence classes. Finally, the key property is normalized with the equivalence classes of the join, and both key and equivalence properties are projected with the join’s output attributes.

4.4.9 Other Operations

Not all plan operators propagate source properties or add new properties. A union all operator is an example of this plan operator class. This class of plan operators propagate empty properties, which might later be added to by upstream plan operators. Empty property propagation is the default behavior for any plan operator that does not override the superclass method for computing logical properties. Another class of operators propagates its source properties unchanged. A sort operator is an example of this operator class. A semi-join propagates the properties of its non-filtering source.

5 Future Work

This feature extends PrestoDB with initial support for constraint optimization. The support manifests as a general-purpose optimization framework for associating logical properties with the result table produced by a plan node that optimization rules might exploit to generate more efficient plans. These logical properties initially derive from constraints defined on database tables that are further augmented and refined by grouping, limiting, predicate application, and other operations. The feature adds new optimization rules that exploit logical properties to eliminate redundant query operations. The framework introduced by this feature supports future extensions to its logical properties and optimization rules. The remainder of this section provides an overview of several possible extensions.

5.1 Foreign Key Constraints

A foreign key relates a child table to a parent table by referencing its primary key. A foreign key constraint specifies that the foreign key attributes can only contain values occurring in the referenced primary key. It represents an inclusion relationship wherein every combination of foreign key values is guaranteed to be present in the primary key. Foreign key relationships can be useful in a variety of optimizations. Lossless join optimizations represent one class of optimizations. A lossless join is a join between a child table and a parent table wherein all rows of the child table are guaranteed to join with a row in the parent table. Lossless join optimization can enable the elimination of a redundant join to the parent table [4], or can enable operations like a top n or aggregation operation to push through the join to the child table [5]. Future work will extend the logical property framework with properties and propagation logic for foreign key constraints. Moreover, additional extensions to the framework will add rules that perform lossless join transformations and other optimizations based on referential constraints.

5.2 Functional Dependencies

A functional dependency is a relationship between two sets of a table’s attributes X and Y, such that if any two rows agree on the values of X they also agree on the values of Y. The relationship holds between the primary key attributes and all other attributes of a table. However, functional dependencies also arise from predicate application, function evaluation, and other query operations. A variety of optimizations exploit functional dependencies [10] Order optimizations utilize functional dependencies to eliminate sorts or to reduce the number of sorting columns [11] . Optimizations that rewrite queries to use materialized views also exploit functional dependencies [12]. Future work will extend the logical property framework with support for a functional dependency property. That work will also include optimization rules that make use of that property, such as those that perform order optimizations.

5.3 Order Dependencies

An order dependency is a relationship between two sets of a table’s attributes X and Y, such that if the table is ordered on the values of X it is also ordered on the values of Y. This relationship holds when the values of Y are monotonically non-decreasing with respect to the values of X. Order dependencies arise naturally in databases. For example, one might exist between attributes month and quarter, or between attributes income and taxes. As with functional dependencies, sorting can be eliminated or optimized using order dependencies [13]. They can be described by integrity constraints or inferred from query operations. Straightforward extensions to the logical properties framework will enable order dependencies to be derived and propagated through various query operations. This extension will also add rules that perform order optimizations based upon order dependencies.

5.4 Constraint DDL

This work will extend PrestoDB DDL with the ability to define constraints for both managed or external tables. These language extensions will be abstracted to enable the definition of constraints for various backend systems whose constraint implementations might differ in terms of the aspects of the SQL standard they implement and the source-specific standard extensions they provide.

References 

[1] Constraints - Oracle.” https://docs.oracle.com/cd/B19306_01/server.102/b14200/clauses002.htm. [2] “Constraints in Query Optimization - IBM.” https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0/com.ibm.db2.luw.admin.perf.doc/doc/c0055081.html. [3] ““Distinct Elimination - Oracle.” https://docs.oracle.com/javadb/10.8.3.0/tuning/ctuntransform16279.html. [4] “Redundant Join Elimination - Oracle.”    https://oracle-base.com/articles/misc/join-elimination. [5] “On saying “Enough already!” in SQL.” On saying “Enough already!” in SQL. https://dl.acm.org/doi/10.1145/253262.253302 [6] “Enforced Constraints - IBM.” https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/intro/src/tpc/db2z_db2enforcesreferentialconstraints.html. [7] “Informational Constraints - IBM.” https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0023324.html. [8] “Hive Data Definition Language.” https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL. [9] “Class HiveMetaStoreClient.” https://hive.apache.org/javadocs/r3.1.2/api/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.html. [10] Tompa, F. “Exploiting functional dependence in query optimization.” https://www.semanticscholar.org/paper/Exploiting-functional-dependence-in-query-Tompa-Larson/c38559e636398a4bc14faa5554c0fb996256dcfc. [11] Simmen, David. “Fundamental Techniques for Order Optimization.” https://cs.uwaterloo.ca/~gweddell/cs798/p57-simmen.pdf. [12] “Advanced Query Rewrite for Materialized Views.” https://docs.oracle.com/database/121/DWHSG/qradv.htm#DWHSG08022. [13] Szlichta, J. “Fundamentals of Order Dependencies.”  https://arxiv.org/pdf/1208.0084.pdf.

Implementation

The working in progress PR is https://github.com/prestodb/presto/pull/16416. Note that it's not ready for review yet

yingsu00 commented 3 years ago

cc @rongrong @kaikalur @rschlussel @arhimondr

yingsu00 commented 3 years ago

cc @biswapesh @oerling

mbasmanova commented 2 years ago

@simmend David, this work is very interesting. Curious if there are any follow-ups in progress.