pires / kubernetes-elasticsearch-cluster

Elasticsearch cluster on top of Kubernetes made easy.
Apache License 2.0
1.51k stars 690 forks source link

Data nodes failing to restart #239

Open gcyre opened 5 years ago

gcyre commented 5 years ago

Hello

I've followed the instructions to setup a ES cluster in our test k8s cluster and have run into a couple of issues with the data nodes.

The first issue is with creating the 2 data nodes, I'm not able to create 2 data pods on a single k8s worker node. In my es-data.yaml I've updated it to use a PVC on an external SAN. When the second pod comes up there's an error in the logs

org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/data/data/myesdb]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
    at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:140) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:127) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-6.3.2.jar:6.3.2]
    at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-6.3.2.jar:6.3.2]
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:93) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:86) ~[elasticsearch-6.3.2.jar:6.3.2]
Caused by: java.lang.IllegalStateException: failed to obtain node locks, tried [[/data/data/myesdb]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
    at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:243) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.node.Node.<init>(Node.java:270) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.node.Node.<init>(Node.java:252) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:213) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:213) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:326) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:136) ~[elasticsearch-6.3.2.jar:6.3.2]

The second issue is when the pod dies and get restarted by kubernetes. When the pod comes up there are errors relating to lock files

[2018-11-05T21:01:32,426][WARN ][o.e.i.e.Engine           ] [es-data-6c487c6b77-nfqv9] [logstash-2018.10.28][2] could not lock IndexWriter
org.apache.lucene.store.LockObtainFailedException: Lock held by another program: /data/data/nodes/0/indices/3V6MF-EXTiOJ-yO93SVX8g/2/index/write.lock
    at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:130) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
    at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
    at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
    at org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:104) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:948) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
    at org.elasticsearch.index.engine.InternalEngine.createWriter(InternalEngine.java:1939) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.engine.InternalEngine.createWriter(InternalEngine.java:1930) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:191) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:157) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:2160) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:2142) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1349) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1304) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:420) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:301) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1575) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$5(IndexShard.java:2028) ~[elasticsearch-6.3.2.jar:6.3.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:626) [elasticsearch-6.3.2.jar:6.3.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]

Should I be creating the pods as Statefulsets? Am I missing something in the configuration?

any help would be appreciated

thanks Garry