One node down search fails

JimKerwood commented 13 years ago

Seems if one node of the Solandra cluster is down/being bounced the query's fail. Not sure if a put will fail or not.

tjake commented 13 years ago

You mean reads? If you want it to work you would need to increase the replication factor of cassandra for the L keyspace.

JimKerwood commented 13 years ago

We don't even get that far. Either it will time out with the 1024 tries or if while it is trying I bring back up the node it will throw an exception with connection refused (since it isn't initilaized I'm guessing but the port is there).

tjake commented 13 years ago

You are saying you can't even start solandra?

To change the replication factor use the supplied cassandra-cli tool:

cassandra-tool/cassandra-cli --host localhost

update keyspace L with replication_factor=2;

JimKerwood commented 13 years ago

No here is the use case: 1) All 6 boxes running. Querys all work. 2) Bring 1 box down for maint. Querys now start timing out. Assume querys should continue on running boxes. 3) Bring box back up. Any query trying gets a socket timeout. 4) When all back running all querys work.

I think the HTTP request is trying to hit all 6 boxes. It is failing there not even down at the Cassandra level.

Some of the stacktrace:

HTTP ERROR 500 Problem accessing /solandra/checks/select. Reason: org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection refused org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection refused at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:282) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) ...

Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection refused at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)

at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:422)

tjake commented 13 years ago

Why should queries continue when there is missing data? If you have a replication factor of 1 and you take down a box then it should error IMO.

I think once you turn it back on the cluster should start working again if thats not the case then that's a bug and I need to fix it....

JimKerwood commented 13 years ago

Agree with the replication factor of 1.
So you are saying if I have a replication factor of 2 and I have one machine down this will not error anymore? If so I am satisfied.
Though if I set the replication factor to 2 and it still errors with one machine down I would say this should be fixed.

tjake commented 13 years ago

Correct. If you change the replication factor and repair the nodes using cassandr-tools/nodetool -h localhost repair L on each node then it will work.

JimKerwood commented 13 years ago

Even after changing replication and repairing problem exists. If a node is down all other nodes wait (timeout if left long enough)

davidstrauss commented 13 years ago

@JimKerwood Can you reproduce this issue with a fresh cluster set to RL=2 before you set any schemas or index any data?

tjake commented 13 years ago

I misspoke. Rf=2 is tricky because a quorum is 2. Quorum is used internally for document Id and shard tracking.

Rf=3 should work

topoqdm commented 12 years ago

Hi

I have two nodes running, set replication_factor:3 and run repair tool on L keyspace. When one of the nodes goes down, search fails on the remaining node.

I get this exception

read command failed after 1024attempts java.io.IOException: Read command failed after 1024attempts at lucandra.CassandraUtils.robustRead(CassandraUtils.java:625) at lucandra.CassandraUtils.robustRead(CassandraUtils.java:634) at solandra.SolandraComponent.flushCache(SolandraComponent.java:67) at solandra.SolandraComponent.prepare(SolandraComponent.java:115) at solandra.SolandraQueryComponent.prepare(SolandraQueryComponent.java:45) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at solandra.SolandraDispatchFilter.execute(SolandraDispatchFilter.java:171) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at solandra.SolandraDispatchFilter.doFilter(SolandraDispatchFilter.java:137) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

request: http://192.168.1.99:8983/solandra/reuters~0/select at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)

I also tried changing solandra.consistency from QUORUM to ONE on solandra.properties, but this didn't help.

Any ideas how to fix this or if i'm doing something wrong?

ghost commented 11 years ago

Hi Jake, I tried replication 2 and 3, the problem persists, once you have a node down you cannot do any request to any other live nodes.

Thanks

tjake / Solandra

One node down search fails #79