trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.46k stars 3.01k forks source link

Flaky TestAlluxioFileSystem #23596

Open ebyhr opened 1 month ago

ebyhr commented 1 month ago

https://github.com/trinodb/trino/actions/runs/11073808274/job/30771180097

Error:  io.trino.filesystem.alluxio.TestAlluxioFileSystem -- Time elapsed: 90.72 s <<< ERROR!
java.util.concurrent.ExecutionException: org.testcontainers.containers.ContainerLaunchException: Container startup failed for image alluxio/alluxio:2.9.5
    at java.base/java.util.concurrent.CompletableFuture.wrapInExecutionException(CompletableFuture.java:345)
    at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:440)
    at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2117)
    at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:327)
    at org.testcontainers.junit.jupiter.TestcontainersExtension$StoreAdapter.start(TestcontainersExtension.java:276)
    at org.testcontainers.junit.jupiter.TestcontainersExtension$StoreAdapter.access$200(TestcontainersExtension.java:263)
    at org.testcontainers.junit.jupiter.TestcontainersExtension.lambda$null$4(TestcontainersExtension.java:83)
    at org.testcontainers.junit.jupiter.TestcontainersExtension.lambda$startContainers$5(TestcontainersExtension.java:83)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1597)
    at org.testcontainers.junit.jupiter.TestcontainersExtension.startContainers(TestcontainersExtension.java:83)
    at org.testcontainers.junit.jupiter.TestcontainersExtension.beforeAll(TestcontainersExtension.java:57)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1458)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2034)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:189)
Caused by: org.testcontainers.containers.ContainerLaunchException: Container startup failed for image alluxio/alluxio:2.9.5
    at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:359)
    at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:330)
    at java.base/java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:831)
    at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:526)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1575)
Caused by: org.rnorth.ducttape.RetryCountExceededException: Retry limit hit with exception
    at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:88)
    at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:344)
    ... 6 more
Caused by: org.testcontainers.containers.ContainerLaunchException: Could not create/start container
    at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:563)
    at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:354)
    at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81)
    ... 7 more
Caused by: org.testcontainers.containers.ContainerLaunchException: Timed out waiting for log output matching '.*Primary started*
'
    at org.testcontainers.containers.wait.strategy.LogMessageWaitStrategy.waitUntilReady(LogMessageWaitStrategy.java:47)
    at org.testcontainers.containers.wait.strategy.AbstractWaitStrategy.waitUntilReady(AbstractWaitStrategy.java:52)
    at org.testcontainers.containers.GenericContainer.waitUntilContainerStarted(GenericContainer.java:909)
    at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:500)
    ... 9 more
ebyhr commented 1 month ago

https://github.com/trinodb/trino/actions/runs/11081935647/job/30794284771?pr=23419

ebyhr commented 1 month ago

https://github.com/trinodb/trino/actions/runs/11081955893/job/30794333467?pr=23598

ebyhr commented 1 month ago

https://github.com/trinodb/trino/actions/runs/11096912427/job/30827307239

ebyhr commented 1 month ago

https://github.com/trinodb/trino/actions/runs/11100241591/job/30835722630

jja725 commented 1 month ago

From the error log, it seems to me that the master is really slow to start in github env so it's taking a long for master to be ready to serve RPC, that's why we see the UNIMPLEMENTED error. We can probably set a longer timeout or try to allocate a bit more resource for the test.

alluxio.exception.status.UnavailableException: Failed to handshake with master localhost:19998 to load cluster default configuration values: UNIMPLEMENTED: Method not found: alluxio.grpc.meta.MetaMasterConfigurationService/GetConfiguration
    at 
ebyhr commented 1 month ago

@jja725 Could you run stress tests locally and send a PR? This test is very flaky. I hope you will fix it soon.

ebyhr commented 1 month ago

@jja725 Reopen as it happened again. Please take another look.

Error:  io.trino.filesystem.alluxio.TestAlluxioFileSystem -- Time elapsed: 207.6 s <<< ERROR!
java.util.concurrent.ExecutionException: org.testcontainers.containers.ContainerLaunchException: Container startup failed for image alluxio/alluxio:2.9.5
    at java.base/java.util.concurrent.CompletableFuture.wrapInExecutionException(CompletableFuture.java:345)
    at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:440)
    at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2117)
    at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:327)
    at org.testcontainers.junit.jupiter.TestcontainersExtension$StoreAdapter.start(TestcontainersExtension.java:276)
    at org.testcontainers.junit.jupiter.TestcontainersExtension$StoreAdapter.access$200(TestcontainersExtension.java:263)
    at org.testcontainers.junit.jupiter.TestcontainersExtension.lambda$null$4(TestcontainersExtension.java:83)
    at org.testcontainers.junit.jupiter.TestcontainersExtension.lambda$startContainers$5(TestcontainersExtension.java:83)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1597)
    at org.testcontainers.junit.jupiter.TestcontainersExtension.startContainers(TestcontainersExtension.java:83)
    at org.testcontainers.junit.jupiter.TestcontainersExtension.beforeAll(TestcontainersExtension.java:57)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1458)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2034)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:189)
Caused by: org.testcontainers.containers.ContainerLaunchException: Container startup failed for image alluxio/alluxio:2.9.5
    at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:359)
    at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:330)
    at java.base/java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:831)
    at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:526)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1575)
Caused by: org.rnorth.ducttape.RetryCountExceededException: Retry limit hit with exception
    at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:88)
    at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:344)
    ... 6 more
Caused by: org.testcontainers.containers.ContainerLaunchException: Could not create/start container
    at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:563)
    at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:354)
    at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81)
    ... 7 more
Caused by: org.testcontainers.containers.ContainerLaunchException: Timed out waiting for log output matching '.*Primary started*
'
    at org.testcontainers.containers.wait.strategy.LogMessageWaitStrategy.waitUntilReady(LogMessageWaitStrategy.java:47)
    at org.testcontainers.containers.wait.strategy.AbstractWaitStrategy.waitUntilReady(AbstractWaitStrategy.java:52)
    at org.testcontainers.containers.GenericContainer.waitUntilContainerStarted(GenericContainer.java:909)
    at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:500)
    ... 9 more

https://github.com/trinodb/trino/actions/runs/11157959013/job/31013228602

ebyhr commented 1 month ago

https://github.com/trinodb/trino/actions/runs/11204592382/job/31143156194?pr=23690

ebyhr commented 1 month ago

https://github.com/trinodb/trino/actions/runs/11208961823/job/31153442299

ebyhr commented 1 month ago

https://github.com/trinodb/trino/actions/runs/11317299889/job/31470574492

ebyhr commented 1 month ago

@jja725 This test is still flaky. Please take another look.

jja725 commented 1 month ago

@JiamingMai do you mind take a look since I would not be available recently? Probably add more timeout like previous fix

ebyhr commented 4 weeks ago

https://github.com/trinodb/trino/actions/runs/11359681622/job/31596149268

ebyhr commented 2 weeks ago

https://github.com/trinodb/trino/actions/runs/11513454496/job/32050184041

ebyhr commented 2 weeks ago

https://github.com/trinodb/trino/actions/runs/11587724367/job/32260330705

ebyhr commented 1 week ago

https://github.com/trinodb/trino/actions/runs/11713967184/job/32627752858 @JiamingMai @jja725