iconara / cql-rb

Cassandra CQL 3 binary protocol driver for Ruby
106 stars 31 forks source link

cql-rb discovers a removed node and fails to connect to cluster! #88

Closed thedebugger closed 10 years ago

thedebugger commented 10 years ago

I recently removed a node from our cassandra cluster, and after that when I'm trying to connect my app, which uses cql-rb, to cassandra cluster it fails to connect. It discovers the removed node (not sure how) and throws an un-handled exception "connection refused" on the removed node however it is able to connect to other nodes just logs.

I've checked on every cassandra node in the cluster that removed node is not present by running "nodetool status". I've also checked that I'm not passing this host in the host array to cql-rb at the time of connecting. Another app which uses datastax java driver is able to connect and seems to be working fine.

Environment:

I'm planning on upgrading to 2.0.6 as 2.0.1 has quite lot of bugs. Neither, our cassandra cluster is in happy state - couple of nodes see each other DOWN. i'm hoping that will get fixed once i upgrade.

So i've couple of question

Let me know if you need more details.

thedebugger commented 10 years ago

Sorry, I was incorrect; cql-rb is able to connect just fine to the cluster. The error is logged as WARN and is able to connect to cluster. The app was failing for a different reason. Anyways, I you have spare time then I'd like to know how discovery works.

iconara commented 10 years ago

Sounds like you're running into CASSANDRA-6053. Upgrade to 2.0.6 and it will go away. You can also scrub the system.peers table manually if you're feeling daring.

iconara commented 10 years ago

The connection and peer discovery flow looks like this:

It's the most complicated part of the whole driver, by far. There's lots of things going on, all in parallel and asychronously.

In addition, this is how the driver manages to stay up when nodes go down:

This mechanism makes it possible (with some application error handling) to have the application stay up during a rolling cluster restart. I've upgraded a four node C* cluster while a distributed application was sending tens of thousands of operations per second to it. It's like changing the engines on a plane, in flight.

thedebugger commented 10 years ago

awesome. thanks for writing it down. I think I'll do the upgrade rather then changing that table.