Reducing Iceberg TableScan::planFiles calls

lxynov commented 2 years ago

Creating this issue to share findings and as reference thread for further discussion. cc @findepi @electrum @phd3

It'll be beneficial to reduce Iceberg TableScan::planFiles calls, because

They can be expensive. Consider a 100TB fact table with average file size as 500MB, this means 200k data files. Iterating over TableScan::planFiles means processing 200k Avro records. It's not that bad but also not something that can be neglected.
They are executed in a shared Iceberg worker pool. It's shared among queries so high concurrency of metadata-heavy queries can make things worse. And this thread pool is not managed by Trino so its memory usage is not tracked.

Trino calls TableScan::planFiles at several places:

when generating splits during query execution (IcebergSplitSource calls TableScan::planTasks, which calls planFiles under the hood)
when getting table properties
when computing partitions and files metadata tables
when computing table stats during query planning

I think we have a few ways to improve on the status quo. As a starting point, we can try reducing planFiles calls during query planning. Specifically, during query planning, cost-based optimizers calls IcebergMetadata::getTableStatistics to get table stats, which in turn iterates over planFiles to compute table stats. We have observed redundant planFiles calls. As an example, planning the query below calls planFiles 5 times on customer, 5 times on store_sales and 4 times on store. While, in theory, it only needs to call once per table.

SELECT
  count(1)
FROM
  iceberg.tpcds_sf100000.customer,
  iceberg.tpcds_sf100000.store_sales,
  iceberg.tpcds_sf100000.store
WHERE
  c_customer_sk = ss_customer_sk
  AND ss_store_sk = s_store_sk;

The redundant getTableStatistics calls are because:

CachingStatsProvider re-computes the referenced node's stats in case of group reference. I think we can modify CachingStatsProvider to avoid that.
CachingStatsProvider is created per optimizer Rule. So its cache is only per rule iteration. I think we can have some table stats cache that's working during an entire query cycle.

findepi commented 2 years ago

CachingStatsProvider is created per optimizer Rule. So its cache is only per rule iteration.

It is, but it's linked to Memo, so the cache should be for the IterativeOptimizer run. It's still constructed many times over, as there are many IterativeOptimizer instances, but most of these instances never ask for stats. Certainly room for improvement still.

In the above query, how many calls to getTableStatistics do you see? Are the tables partitioned? Unless they are, there are no predicates to be inferred and pushed into the connector, so there is only one stats information per table that we could get and we should be able to remove redundant calls.

@lxynov measuring is said to be the best way to improve things. Would you want to contribute a test which would count expensive IO or metadata processing operations? Perhaps we can enhance TestIcebergMetadataFileOperations for this.

cc @alexjo2144 @phd3 @findinpath @homar

osscm commented 2 years ago

They are executed in a shared Iceberg worker pool. It's shared among queries so high concurrency of metadata-heavy queries can make things worse. And this thread pool is not managed by Trino so its memory usage is not tracked.

I have also noticed this, while in Iceberg it's configurable through a system config and not an iceberg config.

https://github.com/apache/iceberg/blob/a540d9c56bc88f9ef3a1a4df9d9d7b1d85dcf91e/core/src/main/java/org/apache/iceberg/util/ThreadPools.java#L64-L74

if this can be configured at the catalog level (at least), can help for the heavy queries?

lxynov commented 2 years ago

Thanks @findepi.

Previous experiment

In the above query, how many calls to getTableStatistics do you see? Are the tables partitioned?

5 times on iceberg.tpcds_sf100000.customer; 5 times on iceberg.tpcds_sf100000.store_sales; 4 times on iceberg.tpcds_sf100000.store

No they're not partitioned.

FYI these tables are generated by copying TPCDS tables to Iceberg. The copying SQLs can be found here. SF100000 is used because CBO's behavior is affected by table sizes.

Tests for measurement

Added two tests in #11858 .

Measuring the number of calls of getTableStatistics so that it's more straightforward.
Added a test in trino-main because the improvement should be done on the engine side so that it applies to all connectors. Also added a test in trino-iceberg because I found that TPCH Connector's stats calculation is clean so it doesn't reflect the whole picture. People can compare the counts in TestGetTableStatsOperations and TestIcebergGetTableStatsOperations to see the difference.

Small change that fixes stats/cost caching of group-referenced plan nodes

Added in #11858.

The improvements can be found in TestGetTableStatsOperations, TestIcebergGetTableStatsOperations and TestIcebergMetadataFileOperations.
Also tested with a production join query on my local machine. The planning time goes down from ~1.5min to ~30s.

Future work

Add query-level table statistics cache so that it can be shared between ReorderJoins and DetermineJoinDistributionType. Before that, we may need to discuss on https://github.com/trinodb/trino/pull/7140#issuecomment-789676485.
As a matter of fact, ReorderJoins doesn't seem to work at all with Iceberg Connector. From a cursory look, it requires the "number of distinct values" input to calculate the output costs/stats of a JoinNode, otherwise NaN will be returned. However, there's no way to calculate table-level distinct counts for Iceberg tables. We need to see if ReorderJoins can work without distinct counts. If not, we should disable it for Iceberg tables.
Architecturally speaking, Iceberg tables' potentially huge metadata may pose a challenge to the current single-coordinator metadata processing model. Iceberg metadata can be huge because it can store every data file's column statistics like column_sizes, value_counts, lower_bounds, upper_bounds, etc. It's O(N*K), where N is the number of data files and K is the number of columns. In practice, we saw GBs of metadata for certain tables. Coordinator's network I/O could become a bottleneck when processing multiple metadata-heavy queries concurrently.

lxynov commented 2 years ago

if this can be configured at the catalog level (at least), can help for the heavy queries?

@osscm FYI you can also config this by setting a JVM config -Diceberg.worker.num-threads=10 when launching Trino

findepi commented 2 years ago

FYI you can also config this by setting a JVM config -Diceberg.worker.num-threads=10 when launching Trino

We should have a way to control this via Trino catalog config property. This would require the pool not to be staticly-initialized. @lxynov can you please file a separate issue about this? cc @danielcweeks @kbendick

Before that, we may need to discuss on Skip getTableProperties call when getting table stats #7140 (comment).

let's just do this

Small change that fixes stats/cost caching of group-referenced plan nodes Added in https://github.com/trinodb/trino/pull/11858.

awesome!

As a matter of fact, ReorderJoins doesn't seem to work at all with Iceberg Connector. From a cursory look, it requires the "number of distinct values" i

i am working on this

Architecturally speaking, Iceberg tables' potentially huge metadata may pose a challenge to the current single-coordinator metadata processing model.

Agreed!

We have had some planning performance challenges with the Delta Lake connector. So far we were able to mitigate them effectively, but we also considered re-architecting the connector to allow "parallel planning" (more precisely: deferring table scan planning to workers, so we can execute more things in parallel).

For Iceberg, it's too early to determine whether we will need a thing like this or not, but something we need to keep an eye on.

cc @claudiusli @alexjo2144 @ilfrin

osscm commented 2 years ago

FYI you can also config this by setting a JVM config -Diceberg.worker.num-threads=10 when launching Trino

We should have a way to control this via Trino catalog config property. This would require the pool not to be staticly-initialized. @lxynov can you please file a separate issue about this?

@findepi @lxynov added this issue: https://github.com/trinodb/trino/issues/11920

lxynov commented 2 years ago

@findepi

Iceberg metadata only keeps track of the distinct counts at the data file level. (https://iceberg.apache.org/spec/) I think it makes sense because maintaining table-level distinct counts is expensive for daily or hourly ingested tables.

So in this case, current CBO doesn't work well with Iceberg tables. It needs distinct counts to compute join output cost, etc.

As a starting point, we should avoid Trino making getTableStatistics calls to Iceberg tables, as they are expensive and not useful. We can probably hard-code in Iceberg Connector making it return empty table statistics instead of doing real computation. Or we can do some refactoring in the engine, making the CBO stop running as soon as it sees a table's statistics being incomplete. Perhaps the query-level caching of table statistics can help here.
We can consider adding a more naive and less rigorous CBO that works without distinct counts. When the engine sees a table's statistics missing distinct counts, it falls back to the naive CBO, which does best-effort joint reordering, etc.

Just to share again about the expensiveness of Iceberg TableScan::planFiles calls. When it's called by getTableStatistics, no predicate is passed, so a table's entire metadata files will be downloaded and processed, even if the real query has a predicate. And for frequently ingested fact tables, the total size of metadata files can be huge.

raunaqmorarka commented 2 years ago

we should avoid Trino making getTableStatistics calls to Iceberg tables

Can't we already do that by setting iceberg.table-statistics-enabled to false ?

more naive and less rigorous CBO that works without distinct counts

Connector can choose to skip providing NDVs if that is an expensive operation and provide only min/max, row count and nulls count. The CBO will still be able to reorder some joins with this info. Would it be cheaper operation in iceberg to provide those things and skip NDV ?

lxynov commented 2 years ago

Can't we already do that by setting iceberg.table-statistics-enabled to false ?

We can but it requires people to explicitly do that. We can change the default value to false. However, it's still not perfect. When people set it to true, they expect CBO to work well with Iceberg tables, whereas it doesn't.

The CBO will still be able to reorder some joins with this info.

I believe the current CBO requires NDVs to work. It needs them to calculate the output stats of a JoinNode. https://github.com/trinodb/trino/blob/fbe6079f37f943849542c2835950514dbeb1ddf7/core/trino-main/src/main/java/io/trino/cost/ComparisonStatsCalculator.java#L194-L196

Would it be cheaper operation in iceberg to provide those things and skip NDV ?

It's still not a cheap operation in Iceberg. But Iceberg is able to provide those values, whereas it is unable to provide NDVs.

lxynov commented 2 years ago

@findepi @raunaqmorarka

What do you think of

We can probably hard-code in Iceberg Connector making it return empty table statistics instead of doing real computation. Or we can do some refactoring in the engine, making the CBO stop running as soon as it sees a table's statistics being incomplete. Perhaps the query-level caching of table statistics can help here.
consider adding a more naive and less rigorous CBO that works without distinct counts

I can try prototyping them if they look good.

raunaqmorarka commented 2 years ago

I believe the current CBO requires NDVs to work. It needs them to calculate the output stats of a JoinNode.

See DetermineJoinDistributionType#getSizeBasedJoin, that should work in the absence of NDVs as well.

I don't like the idea of a more naive and less rigorous CBO that works without distinct counts. DetermineJoinDistributionType already fulfils that purpose to some extent. Any more guesswork than that could lead to wild estimates which create worse plans than the syntactic join order.

Since @findepi is already planning to add NDV stats to iceberg in near future, I would vote for caching statistics or whatever else it takes to make getting statistics cheaper.

lxynov commented 2 years ago

See DetermineJoinDistributionType#getSizeBasedJoin, that should work in the absence of NDVs as well.

Got it. I believe DetermineJoinDistributionType is actually the "more naive and less rigorous CBO" that I was mentioning about. Previously I was mostly looking at ReorderJoins.

add NDV stats to iceberg in near future

I don't think table-level NDV stats in Iceberg is a near-term thing. It requires to modify Iceberg spec. And for frequently-ingested tables, it's expensive to calculate NDV stats every time an ingestion happens.

I would vote for caching statistics or whatever else it takes to make getting statistics cheaper.

I'll work on it.

findepi commented 2 years ago

For NDVs for Iceberg, see https://docs.google.com/document/d/1we0BuQbbdqiJS2eUFC_-6TPSuO57GXivzKmcTzApivY/edit https://github.com/apache/iceberg-docs/pull/69 https://github.com/apache/iceberg/pull/4537

Admittedly, it's not as fast moving as I would want this to be, but hopefully it doesn't get stalled.

For short-term, disabling table stats is a good workaround. For longer term, we should plan assuming we have table-level NDVs and so we don't need "more naive CBO" that "works" in the absence of NDVs.

lxynov commented 2 years ago

@findepi Awesome! Thank you for working on that. Looking forward to getting it in Iceberg!

I believe at least for two-way join, DetermineJoinDistributionType is indeed the "more naive CBO" I was talking about. It also helps flip sides. https://github.com/trinodb/trino/blob/1e04833d9092767d87bec07b8b3429c42b8d50a2/core/trino-main/src/main/java/io/trino/sql/planner/iterative/rule/DetermineJoinDistributionType.java#L293-L315

trinodb / trino