prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.82k stars 5.31k forks source link

Cost based CTE materialization #21636

Open jaystarshot opened 7 months ago

jaystarshot commented 7 months ago

Currently only strategy CteMaterializationStrategy.ALL is supported i.e all ctes will be materialized. We need to support a cost based strategy which uses plan cost to determine materializable ctes.

The action items are the following

Extends https://github.com/prestodb/presto/issues/19744

jaystarshot commented 2 months ago

Trying to brainstorm some ideas on how to do a cost based implementation

Currently, the logicalCteOptimizer is situated high within the planning hierarchy, a level where historical statistics cannot be used since there is no tracking before that and the plan is a lot different than the point where historical stats are actually linked to execution. Historical statistics become accessible at this point in the planning process.

We cannot simple move the logicalCteOptimizer below this level as it would result in suboptimal planning for scenarios involving non-materialized cases. This is because optimizers would then be unable to push operations below the CTE reference nodes.

prototype for 1. link

  1. A simple use case could be to consider cost as a form of feedback. This would be to have the logicalCteOptimizer develop a CTE plan using a heuristic strategy and save the cte materialized plan in the session. Subsequently, we would execute planning for both the non-CTE and CTE-materialized plans up to the HistoricalStatisticsEquivalentPlanMarkingOptimizer. Once we have the accurate hbo hashes. we can assess the effectiveness of the CTE-materialized plan and, based on cost considerations, choose to either continue with it or fallback to the non-CTE plan. This method should be relatively simple to implement.

  2. Figure out a way to run the logicalCteOptimizer below the point where we have accurate historical stats (directly or indirectly)

cc: @feilong-liu