opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.12k stars 1.69k forks source link

[BUG] sporadic concurrent_modification_exception during query in 2.14 #14032

Closed janheise closed 1 month ago

janheise commented 1 month ago

Describe the bug

As you can see from the screenshot, there is a ConcurrentModificationException going on.

Screenshot 2024-06-06 at 15 11 43

Graylog users that started to use OpenSearch 2.14 noticed that as a problem happening for them in queries, so we started to investigate.

The resulting output from an msearch, that carries this exception looks like the following:

"failed":1,"failures":[{"shard":0,"index":"graylog_0",
"node":"jxRdA49HT4uuwWu7VVGyjw","reason":{"type":"concurrent_modification_exception","reason":null}}]}

So there is no stacktrace or logs at all.

While trying to reproduce the problem, I was lucky to have the debugger attached that caught the exception/resulted in the screenshot above.

The following line in private void updateStaleCountOnCacheInsert(CleanupKey cleanupKey) { throws the exception: https://github.com/opensearch-project/OpenSearch/blob/5b93f2e429996f6324248cb0bc5e2dd3a2150dbb/server/src/main/java/org/opensearch/indices/IndicesRequestCache.java#L571

which was introduced with https://github.com/opensearch-project/OpenSearch/pull/12707 if I'm correct - which also means that it could have/should have probably already hit in 2.13?

The error condition seems to be a bit awkward to reproduce:

A graylog instance that has a random message generator running where I had the attached script/query running reproduced the error quite consistently every 2.5/3k queries against an OpenSearch 2.14 in docker.

Reproducing it, running OpenSearch via ./gradlew run and attaching the debugger takes ca. 40-50k queries until the error shows up.

msearch3-loop.sh.txt

msearch3.req.txt

The query stays identical but fails at some point. I think there needs to be some traffic on the index so that the query is evaluated every time and not cached.

Let me know if you need more infos.

Related component

Search

To Reproduce

We're working on a setup.

Expected behavior

no concurrent modification exception should occur

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Additional context Add any other context about the problem here.

janheise commented 1 month ago

@kiranprakash154 Hi, may I ask you to take a look? I think, because it's a backport, the error will also occur in 3.0.0?

kiranprakash154 commented 1 month ago

Hey @janheise, Thanks for reporting, let me take a look.

kiranprakash154 commented 1 month ago

@janheise what were the contents of your index - "graylog_0" ? Can you provide me that ? It will be easier for me to repro this.

janheise commented 1 month ago

@kiranprakash154 I attached two files: one that shows the index structure and some data. The data get's randomly generated. Is this what you need?

graylog0_def.json graylog0_query.json