opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.54k stars 1.75k forks source link

[BUG] Flaky test in 2.x - org.opensearch.repositories.url.URLSnapshotRestoreIT.testUrlRepository #10827

Closed linuxpi closed 10 months ago

linuxpi commented 11 months ago

Describe the bug Test was flaky in https://build.ci.opensearch.org/job/gradle-check/28581/

REPRODUCE WITH: ./gradlew ':modules:repository-url:internalClusterTest' --tests "org.opensearch.repositories.url.URLSnapshotRestoreIT.testUrlRepository" -Dtests.seed=274F20795072D986 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=mt-MT -Dtests.timezone=Canada/Saskatchewan -Druntime.java=17
peternied commented 11 months ago

Looking into this, from the exception, there is an uncaught exception, that is due to the fail_stale_replica being tripped. I highly suspect this is due to the exception being raised as the cluster started to shutdown.

From the context of the integration test itself its completely unrelated to the source of this error.

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=50, name=opensearch[node_t0][generic][T#1], state=RUNNABLE, group=TGRP-URLSnapshotRestoreIT]
    at __randomizedtesting.SeedInfo.seed([274F20795072D986:8B650CDE4358A58F]:0)
Caused by: org.opensearch.core.concurrency.OpenSearchRejectedExecutionException: rejected execution of java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@278ce497[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@3182b92d[Wrapped task = [threaded] fail_stale_replica]] on org.opensearch.threadpool.Scheduler$SafeScheduledThreadPoolExecutor@6a9f5a57[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 433]
    at __randomizedtesting.SeedInfo.seed([274F20795072D986]:0)
    at app//org.opensearch.common.util.concurrent.OpenSearchAbortPolicy.rejectedExecution(OpenSearchAbortPolicy.java:67)
    at java.base@17.0.8/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:833)
    at java.base@17.0.8/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
    at java.base@17.0.8/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
    at app//org.opensearch.threadpool.ThreadPool.schedule(ThreadPool.java:481)
    at app//org.opensearch.common.util.concurrent.AbstractAsyncTask.rescheduleIfNecessary(AbstractAsyncTask.java:109)
    at app//org.opensearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:174)
    at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849)
    at java.base@17.0.8/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base@17.0.8/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base@17.0.8/java.lang.Thread.run(Thread.java:833)
peternied commented 10 months ago

I've closed out the associated by. experiments, such as adding Thread.sleep(20) statements all over, I've been unable to find any reproduction. I'm going to close out the issue, as I'd rather find a strong reproduction in another test case that hold this up anymore.

./gradlew ':modules:repository-url:internalClusterTest' -Dtests.iters=100 --tests "org.opensearch.repositories.url.URLSnapshotRestoreIT.testUrlRepository" -Dtests.seed=274F20795072D986 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=mt-MT -Dtests.timezone=Canada/Saskatchewan