sonian / elasticsearch-zookeeper

49 stars 54 forks source link

Immediate reaction to nodes with disconnected sessions trigger reshuffling while processes are running and healthy #15

Open mahdeto opened 11 years ago

mahdeto commented 11 years ago

Hi,

I am using the ZK plugin with both publishing options, it's nice so far but for some unknown reason every now and then the session disconnects, this causes the ephemral node corresponding to the node that lost it's session to be lose immediately and at the same time causing the cluster to reshuffle, though the process that lost its session is alive and well and would immediately recreate the node after re-establishing it's session to the server.

Is there a possibility to change the master to not count this immediately as a fault detection, instead it waits for a certain time then does the check again, and after failing this n-times it should start recovery?

imotov commented 11 years ago

Could you post log files corresponding to such event?

mahdeto commented 11 years ago

There relevant part of the log I think is this (log was this):

[2013-04-30 04:00:14,616][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 04:00:14,784][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 04:00:14,846][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 04:00:14,911][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 04:00:15,071][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 04:00:15,199][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 08:10:05,926][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 13:23:45,309][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 17:12:28,013][INFO ][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Master is gone [2013-04-30 17:12:28,013][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Electing master [2013-04-30 17:12:28,110][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Found master: PtKWjrq4QymCGPKy2LrWFQ [2013-04-30 17:12:28,110][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 17:12:28,151][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 17:12:28,154][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 17:12:28,194][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 17:12:28,195][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 17:14:49,571][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Session Disconnected [2013-04-30 17:17:10,455][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Checking if ZooKeeper session should be restarted [2013-04-30 17:17:10,456][INFO ][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Restarting ZooKeeper discovery [2013-04-30 17:17:10,456][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Stopping ZooKeeper [2013-04-30 17:17:10,456][DEBUG][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Closing zooKeeper [2013-04-30 17:17:10,456][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Starting ZooKeeper [2013-04-30 17:17:10,456][INFO ][org.apache.zookeeper.ZooKeeper] Initiating client connection, connectString=node-01:2181,node-02:2181,node-03:2181 sessionTimeout=60000 watcher=com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService$1@5257932b [2013-04-30 17:17:11,137][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Started ZooKeeper [2013-04-30 17:17:11,138][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Restarting ZK Discovery [2013-04-30 17:17:11,138][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Creating root nodes in ZooKeeper [2013-04-30 17:17:11,141][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Registering in ZooKeeper [2013-04-30 17:17:11,148][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Electing master [2013-04-30 17:17:11,148][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Found master: PtKWjrq4QymCGPKy2LrWFQ [2013-04-30 17:17:11,148][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 17:17:12,604][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Session Connected [2013-04-30 17:17:12,604][INFO ][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Master is gone [2013-04-30 17:17:12,604][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Electing master [2013-04-30 17:17:12,606][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Elected as master (N0Xv9y6ZSy663481l6NjGw) [2013-04-30 17:17:12,606][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 17:17:12,748][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:19:38,669][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [922] [2013-04-30 17:19:39,814][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Session Disconnected [2013-04-30 17:19:39,814][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Checking if ZooKeeper session should be restarted [2013-04-30 17:19:39,814][INFO ][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Restarting ZooKeeper discovery [2013-04-30 17:19:39,814][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Stopping ZooKeeper [2013-04-30 17:19:39,814][DEBUG][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Closing zooKeeper [2013-04-30 17:19:39,814][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Starting ZooKeeper [2013-04-30 17:19:39,814][INFO ][org.apache.zookeeper.ZooKeeper] Initiating client connection, connectString=node-01:2181,node-02:2181,node-03:2181 sessionTimeout=60000 watcher=com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService$1@4ea4d7a6 [2013-04-30 17:19:40,145][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Session Connected [2013-04-30 17:19:40,145][TRACE][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Started ZooKeeper [2013-04-30 17:19:40,145][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Restarting ZK Discovery [2013-04-30 17:19:40,145][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Creating root nodes in ZooKeeper [2013-04-30 17:19:40,147][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Registering in ZooKeeper [2013-04-30 17:19:40,155][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Electing master [2013-04-30 17:19:40,156][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Found master: kMgo52H1SLy_fvhOdUdQhA [2013-04-30 17:21:47,028][WARN ][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Session Expired Exception [2013-04-30 17:21:47,028][WARN ][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Session Expired Exception [2013-04-30 17:21:47,156][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Session Disconnected [2013-04-30 17:21:47,210][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Restarting ZK Discovery [2013-04-30 17:21:47,210][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 17:21:47,210][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Creating root nodes in ZooKeeper [2013-04-30 17:22:45,011][INFO ][org.apache.zookeeper.ZooKeeper] Initiating client connection, connectString=node-01:2181,node-02:2181,node-03:2181 sessionTimeout=60000 watcher=com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService$1@6de1dadb [2013-04-30 17:22:45,158][DEBUG][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Closing zooKeeper [2013-04-30 17:22:51,978][INFO ][org.apache.zookeeper.ZooKeeper] Initiating client connection, connectString=node-01:2181,node-02:2181,node-03:2181 sessionTimeout=60000 watcher=com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService$1@639d0e0b [2013-04-30 17:22:51,983][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Session Connected [2013-04-30 17:22:51,984][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Creating root nodes in ZooKeeper [2013-04-30 17:22:51,989][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Registering in ZooKeeper [2013-04-30 17:22:52,002][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Electing master [2013-04-30 17:22:52,015][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Elected as master (a37moRArQBOHy56j92xhsw) [2013-04-30 17:22:52,015][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Retrieving new cluster state [2013-04-30 17:22:52,215][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:22:52,272][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[a37moRArQBOHy56j92xhsw]], new nodes: [[a37moRArQBOHy56j92xhsw]], deleted: [[]], added[[]] [2013-04-30 17:23:12,872][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:23:12,873][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[a37moRArQBOHy56j92xhsw]], new nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw]], deleted: [[]], added[[l2E0_dQzThOumbUA7ipjKA]] [2013-04-30 17:23:12,881][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [2] [2013-04-30 17:23:55,949][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:23:55,953][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw]], new nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[]], added[[kMgo52H1SLy_fvhOdUdQhA]] [2013-04-30 17:23:55,959][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [3] [2013-04-30 17:24:16,011][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:24:16,013][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[l2E0_dQzThOumbUA7ipjKA]], added[[]] [2013-04-30 17:24:16,014][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [4] [2013-04-30 17:25:24,182][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:25:24,183][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[]], added[[l2E0_dQzThOumbUA7ipjKA]] [2013-04-30 17:25:24,190][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [5] [2013-04-30 17:26:26,011][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:26:26,013][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[l2E0_dQzThOumbUA7ipjKA]], added[[]] [2013-04-30 17:26:26,014][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [6] [2013-04-30 17:27:30,806][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:27:30,808][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[]], added[[l2E0_dQzThOumbUA7ipjKA]] [2013-04-30 17:27:30,813][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [7] [2013-04-30 17:28:04,290][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:28:04,291][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[l2E0_dQzThOumbUA7ipjKA, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[l2E0_dQzThOumbUA7ipjKA, esVkbqnZSNiIOyWIrJ6Vfg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[]], added[[esVkbqnZSNiIOyWIrJ6Vfg]] [2013-04-30 17:28:04,296][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [8] [2013-04-30 17:28:32,012][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:28:32,014][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[l2E0_dQzThOumbUA7ipjKA, esVkbqnZSNiIOyWIrJ6Vfg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[esVkbqnZSNiIOyWIrJ6Vfg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[l2E0_dQzThOumbUA7ipjKA]], added[[]] [2013-04-30 17:28:32,015][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [9] [2013-04-30 17:29:06,005][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:29:06,006][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[esVkbqnZSNiIOyWIrJ6Vfg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[esVkbqnZSNiIOyWIrJ6Vfg]], added[[]] [2013-04-30 17:29:06,008][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [10] [2013-04-30 17:32:05,826][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:32:05,827][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[NeSmgy0TSKO3pttCkn8Qlg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[]], added[[NeSmgy0TSKO3pttCkn8Qlg]] [2013-04-30 17:32:05,832][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [11] [2013-04-30 17:32:52,443][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:32:52,444][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[NeSmgy0TSKO3pttCkn8Qlg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], new nodes: [[NeSmgy0TSKO3pttCkn8Qlg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA, h-LYwWQHSbidqqa6X-2XYQ]], deleted: [[]], added[[h-LYwWQHSbidqqa6X-2XYQ]] [2013-04-30 17:33:08,012][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:33:22,456][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [12] [2013-04-30 17:33:22,464][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[NeSmgy0TSKO3pttCkn8Qlg, a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA, h-LYwWQHSbidqqa6X-2XYQ]], new nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA, h-LYwWQHSbidqqa6X-2XYQ]], deleted: [[NeSmgy0TSKO3pttCkn8Qlg]], added[[]] [2013-04-30 17:33:22,466][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [13] [2013-04-30 17:33:54,005][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Updating node list [2013-04-30 17:33:54,006][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Current nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA, h-LYwWQHSbidqqa6X-2XYQ]], new nodes: [[a37moRArQBOHy56j92xhsw, kMgo52H1SLy_fvhOdUdQhA]], deleted: [[h-LYwWQHSbidqqa6X-2XYQ]], added[[]] [2013-04-30 17:33:54,007][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperClusterState] [node-05_1] Publishing new cluster state version [14] [2013-04-30 17:34:04,662][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Stopping zooKeeper client [2013-04-30 17:34:04,662][DEBUG][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [node-05_1] Closing zooKeeper [2013-04-30 17:34:04,664][TRACE][com.sonian.elasticsearch.zookeeper.discovery.ZooKeeperDiscovery] [node-05_1] Stopped zooKeeper client

Sorry the entire log has been lost since then. I hope this is of any relevance.

imotov commented 11 years ago

Strange. It looks like some nodes were appearing and disappearing intermittently from the cluster for quite some time. Did you monitor CPU and java heap size on the nodes while these issues were happening? What was going on there? Could it be the case that the cluster was simply overloaded?

mahdeto commented 11 years ago

True, it was overloaded but not to the extent of the process itself dying or OOMing. This might be the reason why the disconnections are happening. But the issue still remains, a disconnected ZKClient means no ephemeral node and an immediate reshuffle (which makes the load problem worse).

I think waiting before initiating recovery or retrying the check for multiple times would be an awesome feature nevertheless. What do you think?

imotov commented 11 years ago

I think the real problem here is cluster overload. Disappearing nodes is just a symptom and zookeeper discovery service is just a messenger. This is how it works - a zookeeper detects that a node is unresponsive for 60 sec and kills its session, zookeeper tells discovery service that this node disappeared and discovery service passes the message upstream telling the rest of the system that the node disappeared, which in turn causes rebalancing, etc. You can increase zookeeper session timeout from the current 60 seconds default to something longer using sonian.elasticsearch.zookeeper.client.session.timeout setting, but I would suggest fixing the real issue - overloaded cluster.