solariumphp / solarium-cloud

solarium extension to connect to SolrCloud via Zookeeper
BSD 2-Clause "Simplified" License
5 stars 1 forks source link

Failover when zookeeper instance is down #3

Open dumityty opened 6 years ago

dumityty commented 6 years ago

This is more of a general question about failover using solarium cloud connecting to a few zookeeper instances.

I have configured 3 Zookeeper servers with 3 shards and 3 replicas, all working ok and able to connect to them.

My solarium cloud config is the following:

[
  'zkhosts' => 'HOST1:2181,HOST2:2181,HOST3:2181',
  'defaultcollection' => 'COLLECTION_NAME',
]

I am able to use solarium cloud and connect ok, perform queries, etc.

But after finally finishing configuring everything and connecting, I decided to test what would happen if one of my instance were to actually go down - the reason behind using SolrCloud in the first place.

I have tried the following scenarios: stop the server altogether, stop Zookeer on the server, stop Solr on the server but have Zookeeper running.

And I got to the following conclusions:

  1. If the server itself is completely down and solarim cloud happens to choose that host to direct the query to then I get "operation timeout" exception - I assume since port 2181 is not reachable at all so the timeout limit kicks in.

  2. If I stop zookeeper on the server and solarim cloud sends the request to that host then I get "connection loss" - I assume since port 2181 is reachable but the service is not running at all so the connection is not established?

  3. If zookeeper is running but I stop Solr on the server, then everything works fine - if solarium cloud sends a request to that host then zookeeper figures out that solr is down and directs the query to another instance which is up - so everything works fine in this case.

My question is whether it's actually possible to get it to failover correctly to the live instances in the first two scenarios? Or am I approaching this the wrong way? Or it's meant to behave that way and I should have failover at a different step?

Would a correct/possible solution be to stick a load balancer in front of the 3 zookeeper instances, and have the health check on port 2181 and if one of the zookeepers is not answering then don't direct any requests to it? In that case my "zkhosts" would be "load_balancer_host:2181"

Not quite sure whether this question is suitable for this issue queue? or I should post it on stack overflow maybe?

Thanks!

jsteggink commented 6 years ago

Hi, Thanks for your feedback!

So how it should work is at follows:

  1. If you have multiple Zookeepers, it should round-robin the servers. If one fails, it should be taken out of the list. However, when it is live again it should be added back to the list. This is something that I haven't build yet.
  2. I don't really get what you mean with this one. You stop Zookeeper service? Okay, so that means nothing is listening on port 2181. This is exactly the same as turning off the server. Or do you mean something else?
  3. I'm glad this works.

Your solution for having a load balancer for Zookeeper is what most people do. Usually it's good enough to just have a health check as you describe. Here's an example by using HAproxy as a load balancer: https://community.hortonworks.com/articles/139439/load-balance-zookeeper-using-haproxy.html

I'm very busy at the moment with lots of different projects, so making a software load balancer for ZK might take a while before I get to it. But I'll keep you updated if there's any progression. I wouldn't expect it before the end of this month. Just so you know.

dumityty commented 6 years ago

Thanks for the quick reply! It was more a question to make sure I understand what is happening and that what I am seeing is correct and as expected. Yes you are right, 1 and 2 are technically the same (1 the server itself is down, 2 the zookeeper service is stopped which technically is the same thing as 1) - but I was just surprised to see two different responses.

Good to know that 1&2 work as expected since the load balancer is not implemented yet as you say. I will have a look at the link you posted and probably end up load balancing the ZK hosts myself.

No worries if you don't have time and thanks for keeping me updated. (I noticed that the Solarium package itself has a LoadBalancer plugin so that could be a starting point :) )