[BUG] Container doesn't start up after ungraceful termination

cpockrandt commented 10 months ago

Describe the bug

OpenSearch 2.11 is running in Kubernetes with 3 pods, pretty vanilla installation with the latest Helm Chart 2.17.0. The pods terminate ungracefully (e.g., through a blackout of the cluster or a bug in the node eviction not respecting the graceful termination period).

After the pods come back up, one of the pods starts outputting errors:

{"type": "server", "timestamp": "2023-11-24T14:52:02,526Z", "level": "ERROR", "component": "o.o.b.OpenSearchUncaughtExceptionHandler", "cluster.name": "os", "node.name": "os-mngr-1", "message": "uncaught exception in thread [main]", 
"stacktrace": ["org.opensearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/opensearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?",
...

It seems there are lock files on the PVC that prevents the container from starting:

/usr/share/opensearch/data/nodes/0/node.lock
/usr/share/opensearch/data/nodes/0/_state/write.lock

What is the recommended way dealing with this situation? Deleting the lock files lets the container start, but unassigned charts remain leaving the cluster in a yellow state. The only solution that worked so far: deleting the PVC and the pod and have the stateful set recreate the PVC and pod, and have the missing replicas of indices get recreated.

Shouldn't be there a way or a documentation how to proceed in such a case? Ideally with a less aggressive strategy than deleting the disk?

I also tried asking the community for help, but no luck so far: https://forum.opensearch.org/t/unassigned-shards-after-killed-containers-blackout/16812

prudhvigodithi commented 9 months ago

[Untriage] Hey @cpockrandt thanks for reporting this bug, may I know if you are using a NFS for the PVC ?

cpockrandt commented 9 months ago

[Untriage] Hey @cpockrandt thanks for reporting this bug, may I know if you are using a NFS for the PVC ?

Hey @prudhvigodithi, I used PVC.

opensearch-project / helm-charts

[BUG] Container doesn't start up after ungraceful termination #511