Open AmiStrn opened 3 months ago
@gbbafna Can you assign it to another label, if not storage who should be investigating?
Storage Triage : Attendees 1 2 3 4
@AmiStrn : Do we know if the issue is only present in Seg Rep clusters ? Can you please provide debug logs if any ?
For now I have removed Storage label , as the issue doesn't relate directly to it .
Sorry, we do not have debug logs for this issue. And we dont have a cluster without segrep to compare this to.
@reta updated the java version in the description we were using (correto 21.0.3.9-1) in case it is related to https://github.com/opensearch-project/OpenSearch/pull/11968
@reta updated the java version in the description we were using (correto 21.0.3.9-1) in case it is related to #11968
So far it seems to be mislead (although we clarified the deployment), the tests seems to be stable for 2.12.0 and correto 21.0.3.9-1
A thread dump would definitely have had been useful
A thread dump would definitely have had been useful
100% agree. this is the kind of thing you want in hindsight but during production incidents are least likely to have time to do (not saying that is correct but it is the way it goes down in most cases).
Describe the bug
Shortly (1-2 weeks) after upgrading from 2.8.0 (with Java 11) to version 2.12.0, with Java 21:
About 1-3 data nodes occasionally cannot respond to the cluster manager (90s timeout) and are dropped from the cluster, rejoining and dropping as soon as most shards are finally allocated. When left alone this cycle lasts around 20-30 minutes.
Log before the node is dropped:
[2024-07-23T22:44:02,791][WARN ][o.o.c.c.LagDetector ] [PROD-manager-host] node [{PROD-data-opensearch-1234}{rXOYfZXrTDmgk_mU-AM1mw}{Hwm5ETnRREeemoBoZI2Yhg}{IP_REDACTED}{IP_REDACTED:9300}{dr}{box_type=default, zone=us-east-1a, shard_indexing_pressure_enabled=true}] is lagging at cluster state version [9372496], although publication of cluster state version [9372497] completed [1.5m] ago
Observations that correlate with a node going to be dropped -
generic
thread pool spikes to the maxTemporarily fixed by downgrading to Java 11, and in some cases to Java 17 (still on OpenSearch 2.12.0).
Related component
Storage:Durability
To Reproduce
Tried so many different ways, could not reproduce this.
Expected behavior
Nodes should remain part of the cluster, and be able to answer the cluster manager in a timely manner if they are not experiencing a truly severe, or abnormal issue.
Additional Details
Plugins discovery-ec2, repository-s3
Screenshots
Correlated spike of indexing (filtered out all nodes except the one failing and a random other one not experiencing issues):
Host/Environment (please complete the following information):
Additional context
Why post this as a bug?