iconara / cql-rb

Cassandra CQL 3 binary protocol driver for Ruby
106 stars 31 forks source link

Getting inconsistent results #95

Closed gsiglet closed 10 years ago

gsiglet commented 10 years ago

Good afternoon,

I am receiving inconsistent results, when I execute the following piece of code multiple times using the same input:

require 'cql' myid=ARGV[0] client = Cql::Client.connect(hosts: ['host1',host2'], default_consistency: :all) client.use('mykeyspace') rows = client.execute("SELECT id FROM mytable WHERE id='#{myid}'") rows.each do |row| puts "Found #{row['id']}" end

Some times the id is found, other times not, although I use consistency "all". Querying from cqlsh I always get consistent results. I also tried with perl driver and I got consistent results. Any idea what might be wrong?

Using cql-rb 1.2.1, Cassandra 2.0.7

Thank you, George

iconara commented 10 years ago

The consistency level shouldn't really matter unless you read immediately after you write, or you have a cluster that has experienced a long partition where hints were dropped. Once the value has been replicated (which takes milliseconds at the most under normal circumstances) any consistency level should give you the value.

There is no way for me to repeat your problem, if there was a general problem with the driver that values were lost the tests would have caught it, and other people would have reported the same problem long ago, so it must be something specific to your setup.

The only things I can think of now is that your CQL string is something else than you expect. Either there is something subtly wrong with the query, or sometimes you get some garbage in the myid variable (a space, a non-printable, the encoding is wrong, something).

Print out the CQL before each request:

query = "SELECT id FROM mytable WHERE id='#{myid}'"
p query
rows = client.execute(query)

Maybe even print out the result to rule out something there:

rows = client.execute("SELECT id FROM mytable WHERE id='#{myid}'")
p rows.to_a

You could also maybe try doing it with a prepared statement, just to rule out that it's your query string that gets bad somehow?

statement = client.prepare('SELECT id FROM mytable WHERE id = ?')
rows = statement.execute(myid)
rows.each do |row|
  puts "Found #{row['id']}"
end
gsiglet commented 10 years ago

I tried the prepared statement, but still the same inconsistency. Could it be some timeout issue? I understand that it is difficult to reproduce this. I was wondering if anybody else had the same problem. Not sure what might be wrong with my setup. I am running a cluster repair now and will see if the problem persists afterwards. Thank you

iconara commented 10 years ago

You could also try to trace the request and see what Cassandra is doing:

result = client.execute(..., trace: true)
puts result.trace_id

Then use the trace ID to load the relevant lines from the system_traces keyspace.

gsiglet commented 10 years ago

I did the tracing and then queried the events and sessions tables in the system_traces keyspace.

Whenever my id is found there are entries in these tables.

Whenever not, there is nothing logged.

iconara commented 10 years ago

But even when you don't get a value back there is a trace ID in result.trace_id? If there is then there's something very strange going on, and if there isn't, that's also very strange.

Another wild question: are you shure that the two hosts you pass into Client.connect are part of the same cluster?

gsiglet commented 10 years ago

Yes, I always get a value for result.trace_id but there is nothing in the system_traces, unless my id is found.

Yes, the two hosts are part of the same cluster. I have also tried with perl and python drivers and I get consistent results using the same input. This is indeed a bit weird.

iconara commented 10 years ago

I really don't understand how you can get a trace ID without the trace being in the database. The only next step I can think of is starting to inspect the raw network traffic between your client and your cluster.

gsiglet commented 10 years ago

I think I start to understand what happened. I tried recently to add a third node to my production cluster. Due to a hard disk problem I had to stop the process and remove the node, while it was in JOIN mode. Later on I added this node to another (staging) cluster.

The missing result.trace_id is logged there now, although this node is now part of another cluster (!). How did this happen? This node is not there in my production cluster and it does not appear anywhere (nodetool status/ring/gossipinfo).

I am currently repairing the cluster, not sure if this fixes the problem.

Thanks for your help

iconara commented 10 years ago

The old node is most likely still in the system.peers table of he remaining nodes. This makes cql-rb connect to it when you give it the other two as seeds.

If you run nodetool gossipinfo (I think that’s the command, I’m writing on my phone so I can’t look it up) you should see the old node.

The reason why the other tools and drivers did not give you inconsistencies is probably because they don’t do peer discovery by default.

Make sure you properly decomission node (since you’ve already torn it down you will need to run nodetool removenode – again caveat the exact name of the command).

Good that you found the root cause and that we have an explanation.

On 30 apr 2014, at 19:25, gsiglet notifications@github.com wrote:

I think I start to understand what happened. I tried recently to add a third node to my production cluster. Due to a hard disk problem I had to stop the process and remove the node, while it was in JOIN mode. Later on I added this node to another (staging) cluster.

The missing result.trace_id is logged there now, although this node is now part of another cluster (!). How did this happen? This node is not there in my production cluster and it does not appear anywhere (nodetool status/ring/gossipinfo).

I am currently repairing the cluster, not sure if this fixes the problem.

Thanks for your help

— Reply to this email directly or view it on GitHub.

gsiglet commented 10 years ago

The node didn't appear in the gossipinfo but I removed it anyway from the peers table and now everything works fine. Thanks a lot!