opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.43k stars 1.72k forks source link

[BUG] IndexActionIT tests are flaky `Shard [test][0] is still locked after 5 sec waiting` #12408

Open peternied opened 6 months ago

peternied commented 6 months ago

Describe the bug

Seeing test failures due to java.lang.AssertionError: Shard [test][0] is still locked after 5 sec waiting

See test report: https://build.ci.opensearch.org/job/gradle-check/33973/testReport/

Test Name Duration Age
org.opensearch.indexing.IndexActionIT.testAutoGenerateIdNoDuplicates {p0={"cluster.indices.replication.strategy":"SEGMENT"}} 4 min 8 sec 1
org.opensearch.indexing.IndexActionIT.testInvalidIndexName {p0={"cluster.indices.replication.strategy":"SEGMENT"}} 5.4 sec 1
org.opensearch.indexing.IndexActionIT.testCreateIndexWithLongName {p0={"cluster.indices.replication.strategy":"SEGMENT"}} 7.2 sec 1
org.opensearch.indexing.IndexActionIT.testCreatedFlag {p0={"cluster.indices.replication.strategy":"SEGMENT"}} 9.2 sec 1
org.opensearch.indexing.IndexActionIT.testCreatedFlagWithExternalVersioning {p0={"cluster.indices.replication.strategy":"SEGMENT"}} 9.4 sec 1
org.opensearch.indexing.IndexActionIT.testCreateFlagWithBulk {p0={"cluster.indices.replication.strategy":"SEGMENT"}} 8.5 sec 1
org.opensearch.indexing.IndexActionIT.testDocumentWithBlankFieldName {p0={"cluster.indices.replication.strategy":"SEGMENT"}} 7.6 sec 1
org.opensearch.indexing.IndexActionIT.testCreatedFlagWithFlush {p0={"cluster.indices.replication.strategy":"SEGMENT"}} 8 sec 1

Related component

Build

To Reproduce

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indexing.IndexActionIT" -Dtests.method="testAutoGenerateIdNoDuplicates {p0={"cluster.indices.replication.strategy":"SEGMENT"}}" -Dtests.seed=6CB4AD2130F2C392 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en-PH -Dtests.timezone=Pacific/Yap -Druntime.java=21

Expected behavior

Tests are reliable

Additional Details

No response

mgodwan commented 6 months ago

@mch2 @dreamer-89 Could you please check this?

mch2 commented 5 months ago

Took a look here - this should get fixed with https://github.com/opensearch-project/OpenSearch/pull/11977. There is a race on shard shutdown causing some file handles to be left open. With WindowsFS this will throw an error. Will clean up the PR and get it in asap.