Open JimKerwood opened 13 years ago
You mean reads? If you want it to work you would need to increase the replication factor of cassandra for the L keyspace.
We don't even get that far. Either it will time out with the 1024 tries or if while it is trying I bring back up the node it will throw an exception with connection refused (since it isn't initilaized I'm guessing but the port is there).
You are saying you can't even start solandra?
To change the replication factor use the supplied cassandra-cli tool:
cassandra-tool/cassandra-cli --host localhost
update keyspace L with replication_factor=2;
No here is the use case: 1) All 6 boxes running. Querys all work. 2) Bring 1 box down for maint. Querys now start timing out. Assume querys should continue on running boxes. 3) Bring box back up. Any query trying gets a socket timeout. 4) When all back running all querys work.
I think the HTTP request is trying to hit all 6 boxes. It is failing there not even down at the Cassandra level.
Some of the stacktrace:
HTTP ERROR 500 Problem accessing /solandra/checks/select. Reason: org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection refused org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection refused at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:282) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) ...
Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection refused at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:422)
Why should queries continue when there is missing data? If you have a replication factor of 1 and you take down a box then it should error IMO.
I think once you turn it back on the cluster should start working again if thats not the case then that's a bug and I need to fix it....
Agree with the replication factor of 1.
So you are saying if I have a replication factor of 2 and I have one machine down this will not error anymore? If so I am satisfied.
Though if I set the replication factor to 2 and it still errors with one machine down I would say this should be fixed.
Correct. If you change the replication factor and repair the nodes using cassandr-tools/nodetool -h localhost repair L on each node then it will work.
Even after changing replication and repairing problem exists. If a node is down all other nodes wait (timeout if left long enough)
@JimKerwood Can you reproduce this issue with a fresh cluster set to RL=2 before you set any schemas or index any data?
I misspoke. Rf=2 is tricky because a quorum is 2. Quorum is used internally for document Id and shard tracking.
Rf=3 should work
Hi
I have two nodes running, set replication_factor:3 and run repair tool on L keyspace. When one of the nodes goes down, search fails on the remaining node.
I get this exception
read command failed after 1024attempts java.io.IOException: Read command failed after 1024attempts at lucandra.CassandraUtils.robustRead(CassandraUtils.java:625) at lucandra.CassandraUtils.robustRead(CassandraUtils.java:634) at solandra.SolandraComponent.flushCache(SolandraComponent.java:67) at solandra.SolandraComponent.prepare(SolandraComponent.java:115) at solandra.SolandraQueryComponent.prepare(SolandraQueryComponent.java:45) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at solandra.SolandraDispatchFilter.execute(SolandraDispatchFilter.java:171) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at solandra.SolandraDispatchFilter.doFilter(SolandraDispatchFilter.java:137) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
request: http://192.168.1.99:8983/solandra/reuters~0/select at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
I also tried changing solandra.consistency from QUORUM to ONE on solandra.properties, but this didn't help.
Any ideas how to fix this or if i'm doing something wrong?
Hi Jake, I tried replication 2 and 3, the problem persists, once you have a node down you cannot do any request to any other live nodes.
Thanks
Seems if one node of the Solandra cluster is down/being bounced the query's fail. Not sure if a put will fail or not.