trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.26k stars 2.95k forks source link

Flaky `TestIcebergMinioOrcConnectorTest.testOptimizeTimePartitionedTable` #21736

Closed ebyhr closed 2 months ago

ebyhr commented 5 months ago
Error:  io.trino.plugin.iceberg.TestIcebergMinioOrcConnectorTest.testOptimizeTimePartitionedTable -- Time elapsed: 92.22 s <<< ERROR!
io.trino.testing.QueryFailedException: Failed to close manifest writer
    at io.trino.testing.AbstractTestingTrinoClient.execute(AbstractTestingTrinoClient.java:133)
    at io.trino.testing.DistributedQueryRunner.executeInternal(DistributedQueryRunner.java:501)
    at io.trino.testing.DistributedQueryRunner.executeWithPlan(DistributedQueryRunner.java:490)
    at io.trino.testing.QueryAssertions.assertDistributedUpdate(QueryAssertions.java:108)
    at io.trino.testing.QueryAssertions.assertUpdate(QueryAssertions.java:62)
    at io.trino.testing.AbstractTestQueryFramework.assertUpdate(AbstractTestQueryFramework.java:410)
    at io.trino.plugin.iceberg.BaseIcebergConnectorTest.testOptimizeTimePartitionedTable(BaseIcebergConnectorTest.java:5165)
    at io.trino.plugin.iceberg.BaseIcebergConnectorTest.testOptimizeTimePartitionedTable(BaseIcebergConnectorTest.java:5124)
    at java.base/java.lang.reflect.Method.invoke(Method.java:580)
    at java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1[491](https://github.com/trinodb/trino/actions/runs/8868062958/job/24347265601#step:5:492))
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:2073)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2035)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)
    Suppressed: java.lang.Exception: SQL: ALTER TABLE test_optimize_time_partitioned_timestamp6withtimezone_days EXECUTE optimize WHERE p >= DATE '2022-04-01'
        at io.trino.testing.DistributedQueryRunner.executeInternal(DistributedQueryRunner.java:508)
        ... 13 more
    at software.amazon.awssdk.services.s3.DefaultS3Client.putObject(DefaultS3Client.java:10220)
    at io.trino.filesystem.s3.S3OutputStream.flushBuffer(S3OutputStream.java:219)
    ... 19 more
    Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Unable to execute HTTP request: Timeout waiting for connection from pool
    Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Unable to execute HTTP request: Timeout waiting for connection from pool
    Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 3 failure: Unable to execute HTTP request: Timeout waiting for connection from pool
Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:316)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:282)
    at software.amazon.awssdk.http.apache.internal.conn.ClientConnectionRequestFactory$DelegatingConnectionRequest.get(ClientConnectionRequestFactory.java:92)
    at software.amazon.awssdk.http.apache.internal.conn.ClientConnectionRequestFactory$InstrumentedConnectionRequest.get(ClientConnectionRequestFactory.java:69)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
    at software.amazon.awssdk.http.apache.internal.impl.ApacheSdkHttpClient.execute(ApacheSdkHttpClient.java:72)
    at software.amazon.awssdk.http.apache.ApacheHttpClient.execute(ApacheHttpClient.java:254)
    at software.amazon.awssdk.http.apache.ApacheHttpClient.access$500(ApacheHttpClient.java:104)
    at software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:231)
    at software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:228)
    at software.amazon.awssdk.core.internal.util.MetricUtils.measureDurationUnsafe(MetricUtils.java:99)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.executeHttpRequest(MakeHttpRequestStage.java:79)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:57)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:40)
    at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
    at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
    at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
    at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:72)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
    ... 42 more

https://github.com/trinodb/trino/actions/runs/8868062958/job/24347265601

ebyhr commented 4 months ago

https://github.com/trinodb/trino/actions/runs/9211283607/job/25340360233

ebyhr commented 4 months ago

https://github.com/trinodb/trino/actions/runs/9222982265/job/25375323409?pr=22108

ebyhr commented 4 months ago

https://github.com/trinodb/trino/actions/runs/9231156441/job/25400573375

findepi commented 4 months ago

https://github.com/trinodb/trino/actions/runs/9255857623/job/25460613254?pr=22096

findepi commented 4 months ago

https://github.com/trinodb/trino/actions/runs/9260240275/job/25473698200?pr=22096

findepi commented 4 months ago

Timeout waiting for connection from pool looks like we may be leaking connections (https://github.com/trinodb/trino/pull/22138#issuecomment-2133623525, cc @wendigo )

findepi commented 4 months ago

https://github.com/trinodb/trino/actions/runs/9269652934/job/25501599155?pr=22096

findinpath commented 4 months ago

https://github.com/trinodb/trino/actions/runs/9294322023/job/25579300297

Error:  io.trino.plugin.iceberg.TestIcebergMinioOrcConnectorTest.testOptimizeTimePartitionedTable -- Time elapsed: 304.7 s <<< ERROR!
io.trino.testing.QueryFailedException: Failed to open file: s3://test-iceberg-orc-uefiw4zbt3/iceberg_data/tpch/test_optimize_time_partitioned_timestamp6withtimezone_days-c2e6f5bda0c34b46b66f809c6f5b7cd7/metadata/f9b587a8-44f3-4ceb-b4ec-18b19188888c-m0.avro
    at io.trino.testing.AbstractTestingTrinoClient.execute(AbstractTestingTrinoClient.java:133)
    at io.trino.testing.DistributedQueryRunner.executeInternal(DistributedQueryRunner.java:554)
    at io.trino.testing.DistributedQueryRunner.executeWithPlan(DistributedQueryRunner.java:543)
    at io.trino.testing.QueryAssertions.assertDistributedUpdate(QueryAssertions.java:108)
    at io.trino.testing.QueryAssertions.assertUpdate(QueryAssertions.java:62)
    at io.trino.testing.AbstractTestQueryFramework.assertUpdate(AbstractTestQueryFramework.java:410)
    at io.trino.plugin.iceberg.BaseIcebergConnectorTest.testOptimizeTimePartitionedTable(BaseIcebergConnectorTest.java:5163)
    at io.trino.plugin.iceberg.BaseIcebergConnectorTest.testOptimizeTimePartitionedTable(BaseIcebergConnectorTest.java:5122)
raunaqmorarka commented 3 months ago

https://github.com/trinodb/trino/actions/runs/9438056172/job/25994641052

findepi commented 3 months ago

https://github.com/trinodb/trino/actions/runs/9463902574/job/26070912130?pr=22302

ebyhr commented 3 months ago

The timeout happens at the following line in IcebergMetadata#finishOptimize: https://github.com/trinodb/trino/blob/8b2c12e83c039025c2cbac28987d5b9829560538/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L1528

The RewriteFiles rewriteFiles tries creating Math.max(2, Runtime.getRuntime().availableProcessors()) threads and leads to timeout because the max allowed connection is 2 in the test. https://github.com/apache/iceberg/blob/b6c949cd86c372a87a4d43a557c19dd310af80d8/core/src/main/java/org/apache/iceberg/SystemConfigs.java#L38-L43

I think this flaky test can be fixed by setting an executor like below:

        RewriteFiles rewriteFiles = transaction.newRewrite()
                .scanManifestsWith(...);
ebyhr commented 2 months ago

https://github.com/trinodb/trino/actions/runs/10004759950/job/27657617143?pr=22740