trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.09k stars 2.92k forks source link

Reducing Iceberg TableScan::planFiles calls #11708

Open lxynov opened 2 years ago

lxynov commented 2 years ago

Creating this issue to share findings and as reference thread for further discussion. cc @findepi @electrum @phd3


It'll be beneficial to reduce Iceberg TableScan::planFiles calls, because

Trino calls TableScan::planFiles at several places:

I think we have a few ways to improve on the status quo. As a starting point, we can try reducing planFiles calls during query planning. Specifically, during query planning, cost-based optimizers calls IcebergMetadata::getTableStatistics to get table stats, which in turn iterates over planFiles to compute table stats. We have observed redundant planFiles calls. As an example, planning the query below calls planFiles 5 times on customer, 5 times on store_sales and 4 times on store. While, in theory, it only needs to call once per table.

SELECT
  count(1)
FROM
  iceberg.tpcds_sf100000.customer,
  iceberg.tpcds_sf100000.store_sales,
  iceberg.tpcds_sf100000.store
WHERE
  c_customer_sk = ss_customer_sk
  AND ss_store_sk = s_store_sk;

The redundant getTableStatistics calls are because:

findepi commented 2 years ago
  • CachingStatsProvider is created per optimizer Rule. So its cache is only per rule iteration.

It is, but it's linked to Memo, so the cache should be for the IterativeOptimizer run. It's still constructed many times over, as there are many IterativeOptimizer instances, but most of these instances never ask for stats. Certainly room for improvement still.

In the above query, how many calls to getTableStatistics do you see? Are the tables partitioned? Unless they are, there are no predicates to be inferred and pushed into the connector, so there is only one stats information per table that we could get and we should be able to remove redundant calls.

@lxynov measuring is said to be the best way to improve things. Would you want to contribute a test which would count expensive IO or metadata processing operations? Perhaps we can enhance TestIcebergMetadataFileOperations for this.

cc @alexjo2144 @phd3 @findinpath @homar

osscm commented 2 years ago

They are executed in a shared Iceberg worker pool. It's shared among queries so high concurrency of metadata-heavy queries can make things worse. And this thread pool is not managed by Trino so its memory usage is not tracked.

I have also noticed this, while in Iceberg it's configurable through a system config and not an iceberg config.

https://github.com/apache/iceberg/blob/a540d9c56bc88f9ef3a1a4df9d9d7b1d85dcf91e/core/src/main/java/org/apache/iceberg/util/ThreadPools.java#L64-L74

if this can be configured at the catalog level (at least), can help for the heavy queries?

lxynov commented 2 years ago

Thanks @findepi.

Previous experiment

In the above query, how many calls to getTableStatistics do you see? Are the tables partitioned?

  • 5 times on iceberg.tpcds_sf100000.customer; 5 times on iceberg.tpcds_sf100000.store_sales; 4 times on iceberg.tpcds_sf100000.store
  • No they're not partitioned.
  • FYI these tables are generated by copying TPCDS tables to Iceberg. The copying SQLs can be found here. SF100000 is used because CBO's behavior is affected by table sizes.

Tests for measurement

Added two tests in #11858 .

Small change that fixes stats/cost caching of group-referenced plan nodes

Added in #11858.

Future work

lxynov commented 2 years ago

if this can be configured at the catalog level (at least), can help for the heavy queries?

@osscm FYI you can also config this by setting a JVM config -Diceberg.worker.num-threads=10 when launching Trino

findepi commented 2 years ago

FYI you can also config this by setting a JVM config -Diceberg.worker.num-threads=10 when launching Trino

We should have a way to control this via Trino catalog config property. This would require the pool not to be staticly-initialized. @lxynov can you please file a separate issue about this? cc @danielcweeks @kbendick

let's just do this

Small change that fixes stats/cost caching of group-referenced plan nodes Added in https://github.com/trinodb/trino/pull/11858.

awesome!

As a matter of fact, ReorderJoins doesn't seem to work at all with Iceberg Connector. From a cursory look, it requires the "number of distinct values" i

i am working on this

Architecturally speaking, Iceberg tables' potentially huge metadata may pose a challenge to the current single-coordinator metadata processing model.

Agreed!

We have had some planning performance challenges with the Delta Lake connector. So far we were able to mitigate them effectively, but we also considered re-architecting the connector to allow "parallel planning" (more precisely: deferring table scan planning to workers, so we can execute more things in parallel).

For Iceberg, it's too early to determine whether we will need a thing like this or not, but something we need to keep an eye on.

cc @claudiusli @alexjo2144 @ilfrin

osscm commented 2 years ago

FYI you can also config this by setting a JVM config -Diceberg.worker.num-threads=10 when launching Trino

We should have a way to control this via Trino catalog config property. This would require the pool not to be staticly-initialized. @lxynov can you please file a separate issue about this?

@findepi @lxynov added this issue: https://github.com/trinodb/trino/issues/11920

lxynov commented 2 years ago

@findepi

Iceberg metadata only keeps track of the distinct counts at the data file level. (https://iceberg.apache.org/spec/) I think it makes sense because maintaining table-level distinct counts is expensive for daily or hourly ingested tables.

So in this case, current CBO doesn't work well with Iceberg tables. It needs distinct counts to compute join output cost, etc.

Just to share again about the expensiveness of Iceberg TableScan::planFiles calls. When it's called by getTableStatistics, no predicate is passed, so a table's entire metadata files will be downloaded and processed, even if the real query has a predicate. And for frequently ingested fact tables, the total size of metadata files can be huge.

raunaqmorarka commented 2 years ago

we should avoid Trino making getTableStatistics calls to Iceberg tables

Can't we already do that by setting iceberg.table-statistics-enabled to false ?

more naive and less rigorous CBO that works without distinct counts

Connector can choose to skip providing NDVs if that is an expensive operation and provide only min/max, row count and nulls count. The CBO will still be able to reorder some joins with this info. Would it be cheaper operation in iceberg to provide those things and skip NDV ?

lxynov commented 2 years ago

Can't we already do that by setting iceberg.table-statistics-enabled to false ?

We can but it requires people to explicitly do that. We can change the default value to false. However, it's still not perfect. When people set it to true, they expect CBO to work well with Iceberg tables, whereas it doesn't.

The CBO will still be able to reorder some joins with this info.

I believe the current CBO requires NDVs to work. It needs them to calculate the output stats of a JoinNode. https://github.com/trinodb/trino/blob/fbe6079f37f943849542c2835950514dbeb1ddf7/core/trino-main/src/main/java/io/trino/cost/ComparisonStatsCalculator.java#L194-L196

Would it be cheaper operation in iceberg to provide those things and skip NDV ?

It's still not a cheap operation in Iceberg. But Iceberg is able to provide those values, whereas it is unable to provide NDVs.

lxynov commented 2 years ago

@findepi @raunaqmorarka

What do you think of

I can try prototyping them if they look good.

raunaqmorarka commented 2 years ago

I believe the current CBO requires NDVs to work. It needs them to calculate the output stats of a JoinNode.

See DetermineJoinDistributionType#getSizeBasedJoin, that should work in the absence of NDVs as well.

I don't like the idea of a more naive and less rigorous CBO that works without distinct counts. DetermineJoinDistributionType already fulfils that purpose to some extent. Any more guesswork than that could lead to wild estimates which create worse plans than the syntactic join order.

Since @findepi is already planning to add NDV stats to iceberg in near future, I would vote for caching statistics or whatever else it takes to make getting statistics cheaper.

lxynov commented 2 years ago

See DetermineJoinDistributionType#getSizeBasedJoin, that should work in the absence of NDVs as well.

Got it. I believe DetermineJoinDistributionType is actually the "more naive and less rigorous CBO" that I was mentioning about. Previously I was mostly looking at ReorderJoins.

add NDV stats to iceberg in near future

I don't think table-level NDV stats in Iceberg is a near-term thing. It requires to modify Iceberg spec. And for frequently-ingested tables, it's expensive to calculate NDV stats every time an ingestion happens.

I would vote for caching statistics or whatever else it takes to make getting statistics cheaper.

I'll work on it.

findepi commented 2 years ago

For NDVs for Iceberg, see https://docs.google.com/document/d/1we0BuQbbdqiJS2eUFC_-6TPSuO57GXivzKmcTzApivY/edit https://github.com/apache/iceberg-docs/pull/69 https://github.com/apache/iceberg/pull/4537

Admittedly, it's not as fast moving as I would want this to be, but hopefully it doesn't get stalled.

For short-term, disabling table stats is a good workaround. For longer term, we should plan assuming we have table-level NDVs and so we don't need "more naive CBO" that "works" in the absence of NDVs.

lxynov commented 2 years ago

@findepi Awesome! Thank you for working on that. Looking forward to getting it in Iceberg!

I believe at least for two-way join, DetermineJoinDistributionType is indeed the "more naive CBO" I was talking about. It also helps flip sides. https://github.com/trinodb/trino/blob/1e04833d9092767d87bec07b8b3429c42b8d50a2/core/trino-main/src/main/java/io/trino/sql/planner/iterative/rule/DetermineJoinDistributionType.java#L293-L315