magda-io / magda

A federated, open-source data catalog for all your big data and small data
https://magda.io
Apache License 2.0
509 stars 93 forks source link

Dev ES cluster occasionally loses all its datasets #1276

Open AlexGilleran opened 6 years ago

AlexGilleran commented 6 years ago

Problem description

This is an annoying investigative one.

Occasionally the ES cluster in dev will suddenly end up with 0 datasets. I still haven't found any logs that describe it. It hasn't (yet?) happened in prod, presumably because ES is configured with 2 data nodes and can recover itself.

AlexGilleran commented 6 years ago

Log from this happening (es-data-0):

[2018-06-28T07:39:18,362][INFO ][o.e.c.s.ClusterService   ] [u68-X59] new_master {u68-X59}{u68-X59QTky8cGhy_46Q6A}{KFSbHsT4TXmqc1Xiq0l1MQ}{10.0.1.238}{10.0.1.238:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2018-06-28T07:39:18,394][INFO ][o.e.h.n.Netty4HttpServerTransport] [u68-X59] publish_address {10.0.1.238:9200}, bound_addresses {[::]:9200}
[2018-06-28T07:39:18,394][INFO ][o.e.n.Node               ] [u68-X59] started
[2018-06-28T07:39:19,460][INFO ][o.e.g.GatewayService     ] [u68-X59] recovered [6] indices into cluster_state
[2018-06-28T07:39:19,827][INFO ][o.e.c.m.MetaDataDeleteIndexService] [u68-X59] [datasets32/A3j2ntJ0S1i5A8vADYzFpw] deleting index
[2018-06-28T07:39:27,470][INFO ][o.e.c.m.MetaDataDeleteIndexService] [u68-X59] [regions21/kikpyi1nQiCH1AYMX8TbOQ] deleting index
[2018-06-28T07:39:28,591][INFO ][o.e.c.m.MetaDataCreateIndexService] [u68-X59] [datasets32] creating index, cause [api], templates [], shards [1]/[0], mappings [Format, datasets, Publisher]
[2018-06-28T07:39:28,680][INFO ][o.e.c.m.MetaDataCreateIndexService] [u68-X59] [regions21] creating index, cause [api], templates [], shards [1]/[0], mappings [regions]
[2018-06-28T07:39:28,722][INFO ][o.e.c.r.a.AllocationService] [u68-X59] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[.kibana][0], [datasets32][0]] ...]).

Logs from indexer at the same time:

[INFO] [06/28/2018 07:38:53.590] [main] [IndexerApp$(akka://indexer)] Starting Indexer
[INFO] [06/28/2018 07:38:53.591] [main] [IndexerApp$(akka://indexer)] Log level is INFO
[INFO] [06/28/2018 07:38:53.653] [indexer-akka.actor.default-dispatcher-5] [akka.actor.ActorSystemImpl(indexer)] No password specified, starting without XPack
[INFO] [06/28/2018 07:38:55.056] [main] [IndexerApp$(akka://indexer)] Listening on 0.0.0.0:80
[INFO] [06/28/2018 07:38:56.705] [main] [RegisterWebhook$(akka://indexer)] Looking up existing webhook with id indexer
[INFO] [06/28/2018 07:38:57.851] [indexer-akka.actor.default-dispatcher-2] [RegisterWebhook$(akka://indexer)] Hook already exists, updating...
[INFO] [06/28/2018 07:38:58.271] [indexer-akka.actor.default-dispatcher-3] [akka.actor.ActorSystemImpl(indexer)] Successfully added webhook
[INFO] [06/28/2018 07:38:58.271] [indexer-akka.actor.default-dispatcher-3] [RegisterWebhook$(akka://indexer)] Updated, attempting to resume...
[INFO] [06/28/2018 07:38:58.282] [indexer-akka.actor.default-dispatcher-9] [RegisterWebhook$(akka://indexer)] Successfully resumed webhook
[INFO] [06/28/2018 07:38:58.651] [indexer-akka.actor.default-dispatcher-5] [akka.actor.ActorSystemImpl(indexer)] Successfully made initial contact with the ES client (this doesn't mean we're fully connected yet!)
[ERROR] [06/28/2018 07:39:08.752] [indexer-akka.actor.default-dispatcher-5] [akka.actor.ActorSystemImpl(indexer)] Failed to get indexes, 9 retries left WARNING arguments left: 1
[ERROR] [06/28/2018 07:39:18.771] [indexer-akka.actor.default-dispatcher-8] [akka.actor.ActorSystemImpl(indexer)] Failed to get indexes, 8 retries left WARNING arguments left: 1
[WARN] [06/28/2018 07:39:18.896] [indexer-akka.actor.default-dispatcher-9] [akka.actor.ActorSystemImpl(indexer)] regions index was not present, if this is the first boot with a new index version this is fine: no such index
[WARN] [06/28/2018 07:39:18.920] [indexer-akka.actor.default-dispatcher-9] [akka.actor.ActorSystemImpl(indexer)] datasets index was not present, if this is the first boot with a new index version this is fine: no such index
[INFO] [06/28/2018 07:39:18.922] [indexer-akka.actor.default-dispatcher-9] [akka.actor.ActorSystemImpl(indexer)] datasets index version 32 does not exist, creating
[INFO] [06/28/2018 07:39:18.927] [indexer-akka.actor.default-dispatcher-9] [akka.actor.ActorSystemImpl(indexer)] Attempting to restore snapshot for datasets version 32
[INFO] [06/28/2018 07:39:18.996] [indexer-akka.actor.default-dispatcher-9] [akka.actor.ActorSystemImpl(indexer)] regions index version 21 does not exist, creating
[INFO] [06/28/2018 07:39:19.001] [indexer-akka.actor.default-dispatcher-9] [akka.actor.ActorSystemImpl(indexer)] Attempting to restore snapshot for regions version 21
[INFO] [06/28/2018 07:39:19.503] [indexer-akka.actor.default-dispatcher-9] [akka.actor.ActorSystemImpl(indexer)] Could not find a snapshot for datasets version 32
[INFO] [06/28/2018 07:39:19.504] [indexer-akka.actor.default-dispatcher-5] [akka.actor.ActorSystemImpl(indexer)] Could not find a snapshot for regions version 21
[INFO] [06/28/2018 07:39:28.779] [indexer-akka.actor.default-dispatcher-9] [akka.actor.ActorSystemImpl(indexer)] Index datasets version 32 created
[INFO] [06/28/2018 07:39:28.788] [indexer-akka.actor.default-dispatcher-5] [akka.actor.ActorSystemImpl(indexer)] Index regions version 21 created
[INFO] [06/28/2018 07:39:28.938] [indexer-akka.actor.default-dispatcher-3] [akka.actor.ActorSystemImpl(indexer)] Found shapes for region LGA at /usr/regions/LGA.json, loading from there
[INFO] [06/28/2018 07:39:28.956] [indexer-akka.actor.default-dispatcher-3] [akka.actor.ActorSystemImpl(indexer)] Found shapes for region SA2 at /usr/regions/SA2.json, loading from there
[INFO] [06/28/2018 07:39:28.957] [indexer-akka.actor.default-dispatcher-3] [akka.actor.ActorSystemImpl(indexer)] Found shapes for region SA4 at /usr/regions/SA4.json, loading from there
[INFO] [06/28/2018 07:39:28.962] [indexer-akka.actor.default-dispatcher-3] [akka.actor.ActorSystemImpl(indexer)] Found shapes for region SA3 at /usr/regions/SA3.json, loading from there
[INFO] [06/28/2018 07:39:28.963] [indexer-akka.actor.default-dispatcher-3] [akka.actor.ActorSystemImpl(indexer)] Found shapes for region SA1 at /usr/regions/SA1.json, loading from there
[INFO] [06/28/2018 07:39:56.420] [indexer-akka.actor.default-dispatcher-3] [akka.actor.ActorSystemImpl(indexer)] Found shapes for region POA at /usr/regions/POA.json, loading from there
[INFO] [06/28/2018 07:40:26.460] [indexer-akka.actor.default-dispatcher-7] [akka.actor.ActorSystemImpl(indexer)] Found shapes for region COM_ELB_ID_2016 at /usr/regions/COM_ELB_ID_2016.json, loading from there
[INFO] [06/28/2018 07:40:46.417] [indexer-akka.actor.default-dispatcher-5] [akka.actor.ActorSystemImpl(indexer)] Found shapes for region STE at /usr/regions/STE.json, loading from there
[INFO] [06/28/2018 09:24:31.057] [indexer-akka.actor.default-dispatcher-104] [akka.actor.ActorSystemImpl(indexer)] Successfully indexed 60630 regions
[INFO] [06/28/2018 09:24:31.058] [indexer-akka.actor.default-dispatcher-104] [akka.actor.ActorSystemImpl(indexer)] Creating snapshot for regions at version 21
[INFO] [06/28/2018 09:28:21.561] [indexer-akka.actor.default-dispatcher-110] [akka.actor.ActorSystemImpl(indexer)] Snapshotted 1 shards of 1 for regions21
[INFO] [06/28/2018 09:28:22.458] [indexer-akka.actor.default-dispatcher-108] [akka.actor.ActorSystemImpl(indexer)] Successfully indexed 12 datasets
[INFO] [06/28/2018 09:28:22.688] [indexer-akka.actor.default-dispatcher-111] [akka.actor.ActorSystemImpl(indexer)] Successfully indexed 12 datasets