[BUG] Nodes dropping from cluster - possibly due to Java 21

AmiStrn commented 3 months ago

Describe the bug

Shortly (1-2 weeks) after upgrading from 2.8.0 (with Java 11) to version 2.12.0, with Java 21:

About 1-3 data nodes occasionally cannot respond to the cluster manager (90s timeout) and are dropped from the cluster, rejoining and dropping as soon as most shards are finally allocated. When left alone this cycle lasts around 20-30 minutes.

Log before the node is dropped: [2024-07-23T22:44:02,791][WARN ][o.o.c.c.LagDetector ] [PROD-manager-host] node [{PROD-data-opensearch-1234}{rXOYfZXrTDmgk_mU-AM1mw}{Hwm5ETnRREeemoBoZI2Yhg}{IP_REDACTED}{IP_REDACTED:9300}{dr}{box_type=default, zone=us-east-1a, shard_indexing_pressure_enabled=true}] is lagging at cluster state version [9372496], although publication of cluster state version [9372497] completed [1.5m] ago

Observations that correlate with a node going to be dropped -

High indexing pressure
SegRep delays
right before the drop the generic thread pool spikes to the max

Temporarily fixed by downgrading to Java 11, and in some cases to Java 17 (still on OpenSearch 2.12.0).

Related component

Storage:Durability

To Reproduce

Tried so many different ways, could not reproduce this.

Expected behavior

Nodes should remain part of the cluster, and be able to answer the cluster manager in a timely manner if they are not experiencing a truly severe, or abnormal issue.

Additional Details

Plugins discovery-ec2, repository-s3

Screenshots Screenshot 2024-08-02 at 10 55 11

Screenshot 2024-08-02 at 10 51 37

Screenshot 2024-08-02 at 10 54 21 Screenshot 2024-08-02 at 10 52 07 Screenshot 2024-08-02 at 10 51 55

Correlated spike of indexing (filtered out all nodes except the one failing and a random other one not experiencing issues): Screenshot 2024-08-02 at 11 01 22

Host/Environment (please complete the following information):

OS: Ubuntu 20.4
Version OpenSearch 2.12.0 [EDIT:]
Java version: correto 21.0.3.9-1

Additional context

Why post this as a bug?

while we don't know the actual root cause, after discussing in the last triage meeting with @peternied and @reta we thought it could at least help the community if anyone else is seeking to remedy a similar issue and has not yet tried downgrading the Java version.

gbbafna commented 3 months ago

Storage Triage : Attendees 1 2 3 4

@AmiStrn : Do we know if the issue is only present in Seg Rep clusters ? Can you please provide debug logs if any ?

For now I have removed Storage label , as the issue doesn't relate directly to it .

peternied commented 3 months ago

@gbbafna Can you assign it to another label, if not storage who should be investigating?

AmiStrn commented 3 months ago

Storage Triage : Attendees 1 2 3 4

@AmiStrn : Do we know if the issue is only present in Seg Rep clusters ? Can you please provide debug logs if any ?

For now I have removed Storage label , as the issue doesn't relate directly to it .

Sorry, we do not have debug logs for this issue. And we dont have a cluster without segrep to compare this to.

AmiStrn commented 2 months ago

@reta updated the java version in the description we were using (correto 21.0.3.9-1) in case it is related to https://github.com/opensearch-project/OpenSearch/pull/11968

reta commented 2 months ago

@reta updated the java version in the description we were using (correto 21.0.3.9-1) in case it is related to #11968

So far it seems to be mislead (although we clarified the deployment), the tests seems to be stable for 2.12.0 and correto 21.0.3.9-1

Bukhtawar commented 2 months ago

A thread dump would definitely have had been useful

AmiStrn commented 2 months ago

A thread dump would definitely have had been useful

100% agree. this is the kind of thing you want in hindsight but during production incidents are least likely to have time to do (not saying that is correct but it is the way it goes down in most cases).

opensearch-project / OpenSearch