Iceberg queries stuck in the planning phase for long time

sriharshaj commented 1 month ago

Our Iceberg queries are getting stuck in the planning phase for 2 to 3 minutes, although they eventually run successfully. We are currently upgrading Trino from version 444 to 454. This issue has been occurring since Trino version 451 (we went back and tried different versions).

Our Iceberg stack --> Storage: s3 File format: parquet Catalog: Hive

Query explain:

 Trino version: 454
 Fragment 0 [SINGLE]
     Output layout: [expr]
     Output partitioning: SINGLE []
     Output[columnNames = [_col0]]
     │   Layout: [expr:integer]
     │   Estimates: {rows: 1 (5B), cpu: 0, memory: 0B, network: 0B}
     │   _col0 := expr
     └─ Project[]
        │   Layout: [expr:integer]
        │   Estimates: {rows: 1 (5B), cpu: 5, memory: 0B, network: 0B}
        │   expr := integer '1'
        └─ Limit[count = 1]
           │   Layout: []
           │   Estimates: {rows: 1 (0B), cpu: 0, memory: 0B, network: 0B}
           └─ LocalExchange[partitioning = SINGLE]
              │   Layout: []
              │   Estimates: {rows: 1 (0B), cpu: 0, memory: 0B, network: 0B}
              └─ RemoteSource[sourceFragmentIds = [1]]
                     Layout: []

 Fragment 1 [SOURCE]
     Output layout: []
     Output partitioning: SINGLE []
     LimitPartial[count = 1]
     │   Layout: []
     │   Estimates: {rows: 1 (0B), cpu: 0, memory: 0B, network: 0B}
     └─ TableScan[table = table_name$data@2183018889107447028 constraint on [dt] LIMIT 1]
            Layout: []
            Estimates: {rows: 8499944 (0B), cpu: 0, memory: 0B, network: 0B}
            83:dt:varchar
                :: [[2024-08-30T13:00:00Z, 2024-08-30T14:00:00Z)]

(1 row)

Query 20240830_150248_35329_wgfii, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
3:23 [0 rows, 0B] [0 rows/s, 0B/s]

Query info: https://gist.github.com/sriharshaj/f26e655f233b84754da8216be2ae0172

wendigo commented 1 month ago

This could be related to https://github.com/trinodb/trino/issues/23384

wendigo commented 1 month ago

We will release a new Trino version this week. I'll postpone further investigation so we can check whether the fix in this new version helps.

sriharshaj commented 1 month ago

@wendigo We tested Trino version 458, and while the planning time has improved, it remains unusually high, ranging between 30 to 50 seconds.

wendigo commented 1 month ago

@sriharshaj do you know what accounts for this number? Metadata retrieval from storage? Metastore calls?

sriharshaj commented 1 month ago

Here are the optimizer summaries.

   "optimizerRulesSummaries": [
      {
        "rule": "io.trino.sql.planner.optimizations.AddExchanges",
        "invocations": 1,
        "applied": 1,
        "totalTime": 5120641837,
        "failures": 0
      },
      {
        "rule": "io.trino.sql.planner.iterative.rule.DetermineTableScanNodePartitioning",
        "invocations": 3,
        "applied": 1,
        "totalTime": 4029729665,
        "failures": 0
      },
      {
        "rule": "io.trino.sql.planner.iterative.rule.PushPredicateIntoTableScan",
        "invocations": 1,
        "applied": 1,
        "totalTime": 3586457350,
        "failures": 0
      },
      {
        "rule": "io.trino.sql.planner.iterative.rule.ExpressionRewriteRuleSet.FilterExpressionRewrite",
        "invocations": 58,
        "applied": 1,
        "totalTime": 2148574,
        "failures": 0
      },
      {
        "rule": "io.trino.sql.planner.iterative.rule.ExpressionRewriteRuleSet.ProjectExpressionRewrite",
        "invocations": 92,
        "applied": 0,
        "totalTime": 921214,
        "failures": 0
      },
      {
        "rule": "io.trino.sql.planner.optimizations.PredicatePushDown",
        "invocations": 7,
        "applied": 7,
        "totalTime": 442238,
        "failures": 0
      },
      {
        "rule": "io.trino.sql.planner.iterative.rule.PushLimitIntoTableScan",
        "invocations": 4,
        "applied": 1,
        "totalTime": 293083,
        "failures": 0
      },
      {
        "rule": "io.trino.sql.planner.iterative.rule.PruneTableScanColumns",
        "invocations": 7,
        "applied": 1,
        "totalTime": 273248,
        "failures": 0
      },
      {
        "rule": "io.trino.sql.planner.iterative.rule.PruneOutputSourceColumns",
        "invocations": 16,
        "applied": 1,
        "totalTime": 268922,
        "failures": 0
      },
      {
        "rule": "io.trino.sql.planner.iterative.rule.PruneProjectColumns",
        "invocations": 7,
        "applied": 4,
        "totalTime": 224337,
        "failures": 0
      }
    ],

Is there a way to analyze why the optimizers are taking so long? Additionally, where can I find details on metadata retrieval and Metastore calls?

wendigo commented 1 month ago

@sriharshaj You can enable tracing (opentelemetry) and capture what cluster is doing

sriharshaj commented 1 month ago

@wendigo Can I capture those metrics with JMX? We don't have opentelemetry setup.

wendigo commented 1 month ago

@sriharshaj jmx keeps aggregates, not individual events

sriharshaj commented 1 month ago

@wendigo I installed Trino locally and ran the same query. The planning phase took approximately 15 seconds.

The ConnectorMetadata.getTableProperties method is taking around 1.5 to 2.0 seconds to retrieve the table metadata.

During query optimization in Trino, metadata is being fetched five times, and during the fragment generation phase, it’s being retrieved three additional times.

sriharshaj commented 1 month ago

This issue occurs exclusively with Iceberg queries. For Hive, everything works as expected.

wendigo commented 1 month ago

@sriharshaj I believe that @raunaqmorarka added recently some cache for metadata files.

wendigo commented 1 month ago

What's the version you are using? @sriharshaj

sriharshaj commented 1 month ago

@wendigo We are facing this issue since 451.

I traced down the issue to this change: https://github.com/trinodb/trino/pull/15712/files#diff-e1cb17efec6787989f9df9ee40c4f2809ff3fe946cd2ec721ff8932b131997b8R618.

Our schema contains a large number of nested fields, which results in all columns being mapped to IcebergColumnHandle. When I debugged a specific table, it was mapping approximately 2,260 nested columns to IcebergColumnHandle, which can be significantly impacting the planning.

wendigo commented 1 month ago

@krvikash @raunaqmorarka can you take a look?

sriharshaj commented 1 month ago

Thank you, @wendigo, for your guidance in helping me identify the issue.

sriharshaj commented 1 month ago

@krvikash @raunaqmorarka Any updates?

krvikash commented 1 month ago

Hi @sriharshaj, I did not get a chance to looked into yet. I will take a look in this week.

krvikash commented 1 month ago

Hi @sriharshaj, opened https://github.com/trinodb/trino/pull/23586 to load only required columns in the map. If you have a way to try this change before merge and see if this fix reduce the planning time for your query.

sriharshaj commented 1 month ago

@krvikash Sure, I will try this fix today.

sriharshaj commented 1 month ago

@krvikash I tried the fix, it worked. Planning time is in milliseconds. Thank you for the fix.

krvikash commented 1 month ago

Great!! Thanks @sriharshaj for reporting this and verifying the fix.

trinodb / trino

Iceberg queries stuck in the planning phase for long time #23451