Caching stats information to optimise Query Planning time

agrawalreetika commented 1 year ago

Creating this issue around findings for optimization scope in Iceberg Connector -

Presto calls Iceberg Scan::planFiles (https://github.com/apache/iceberg/blob/apache-iceberg-1.3.1/api/src/main/java/org/apache/iceberg/Scan.java#L155) at multiple places which is one of the contributing factor for long planning time while running queries with Iceberg connector like,

When computing table stats during query planning
When generating splits during query execution

For example (Tpch Q4):

SELECT
  o.orderpriority,
  count(*) AS order_count
FROM
  iceberg.tpch.orders o
WHERE 
  o.orderdate >= DATE '1993-07-01'
  AND o.orderdate < DATE '1993-07-01' + INTERVAL '3' MONTH
  AND EXISTS (
    SELECT
      *
    FROM
      iceberg.tpch.lineitem l
    WHERE
      l.orderkey = o.orderkey
      AND l.commitdate < l.receiptdate
  )
GROUP BY
  o.orderpriority
ORDER BY
  o.orderpriority;

For this query Scan::planFiles is being called 13 times for above query. Specifically, ConnectorMetadata:getTableStatistics is getting called 11 times for the above query. 6 times for lineitems and 5 times for order tables. Here, IcebergMetadata::getTableStatistics in turn iterates over planFiles to compute table stats.

I think we can optimize these calls from getTableStatistics if we can Cache the stats and reuse them for tables.

ZacBlanco commented 1 year ago

@agrawalreetika re: your comment here in did you have any idea for how we would create a generic caching system for all of the lakehouse connectors? ATM I'm not sure it is very doable with having 3 separate connectors for hudi, delta, and iceberg. Considering the work I have going on in #20614, I'd like to implement the stats cache sooner rather than later. How critical is it to create a generic implementation versus an iceberg specific one?

I can probably put together an Iceberg-specific one pretty quickly. I have a prototype in another branch already

agrawalreetika commented 1 year ago

Hi @ZacBlanco, I don't have any design around a generic stats caching system. The idea was, I see in presto for Alluxio data caching there is a common implementation being leveraged for difference lakehosuse connector. So I was thinking if we could have a Cache provider in a similar common package and leverage it across. Great, if you already have the Iceberg-specific implementation already. Maybe you can put the details out with the changes and others can also provide input on this?

tdcmeehan commented 8 months ago

Is this issue still relevant with manifest caching and table caching recently implemented?

agrawalreetika commented 8 months ago

The part about reducing redundant metadata calls using Iceberg Table object is handled by recent implementation and we have already seen improvement in the planning time in the Iceberg Connector.

prestodb / presto

Caching stats information to optimise Query Planning time #20725