Possible Compressor Issue?

adstage-david commented 10 years ago

I'm using the latest prerelease (1.2.0.pre0) to try out compression and I'm running into an interesting problem - sometimes cql-rb returns empty results for rows that should exist. I have a column family (folders) configured with replication factor 3, and I'm doing a select on the row key for that column family (folder_id), then looking in another column family (folder_buckets - also replication factor of 3) for a number of rows (row key: folder_id + shard_id). My cluster has 4 nodes running on EC2.

I'm finding that when I have the compressor enabled, most of my requests to a single folder fail (50 repeated tests to load the same folder + buckets), either failing to fetch the initial folder, or dropping some of the bucket shards. This folder and the associated bucket rows were all created within the last hour or so.

Removing the compressor allows 100% of the requests to succeed. Restarting with compressor enabled and they start mostly failing again.

I'm thinking there might be some consistency issue here, all of my requests are going out with consistency of :one (as did all of the populating writes). I was going to report this bug last night - when I first encountered it, but then it mysteriously went away while I was running a few more tests to confirm.

Ruby: jruby-1.7.5 C*: 1.2.12.2 cql-rb: 1.2.0.pre0

Trying to find an isolating test case, but I'm not sure if I'll be able to given the transient nature of the bug. As I wrote this issue, decided to try running repairs on affected column families to see if it'd help. Appears to have fixed the issue for me, hopefully that tells us something useful?

iconara commented 10 years ago

Thanks for testing the compression feature, it really helps shake down issues.

It sounds odd that you get empty results back when clmpression is enabled. Maybe there’s an error that gets swallowed somewhere? Could you try adding a p data to the #receive_data metod of Cql::Protocol::CqlProtocolHandler class? (I might have gotten the exact bame of the method wrong, I’m on my phone). Then post a frame that should have contained cells but didn’t. That would help figuring out if the right data comes from Cassandra or not.

You could also try to debug print the data that goes through the compressor, maybe there are clues there. If you’ve got time tou could also try to trace the call all the way through and see if there’s an error being raised somewhere (come to think of it there might be a short cut: JRuby has an option that makes it print all generated stack traces, that could at least rule out swallowed errors).

I will need something more to go on, I’m in the dark here.

On 8 jan 2014, at 22:26, David Haslem notifications@github.com wrote:

I'm using the latest prerelease (1.2.0.pre0) to try out compression and I'm running into an interesting problem - sometimes cql-rb returns empty results for rows that should exist. I have a column family (folders) configured with replication factor 3, and I'm doing a select on the row key for that column family (folder_id), then looking in another column family (folder_buckets - also replication factor of 3) for a number of rows (row key: folder_id + shard_id). My cluster has 4 nodes running on EC2.

I'm finding that when I have the compressor enabled, most of my requests to a single folder fail (50 repeated tests to load the same folder + buckets), either failing to fetch the initial folder, or dropping some of the bucket shards. This folder and the associated bucket rows were all created within the last hour or so.

Removing the compressor allows 100% of the requests to succeed. Restarting with compressor enabled and they start mostly failing again.

I'm thinking there might be some consistency issue here, all of my requests are going out with consistency of :one (as did all of the populating writes). I was going to report this bug last night - when I first encountered it, but then it mysteriously went away while I was running a few more tests to confirm.

Ruby: jruby-1.7.5 C*: 1.2.12.2 cql-rb: 1.2.0.pre0

Trying to find an isolating test case, but I'm not sure if I'll be able to given the transient nature of the bug. As I wrote this issue, decided to try running repairs on affected column families to see if it'd help. Appears to have fixed the issue for me, hopefully that tells us something useful?

— Reply to this email directly or view it on GitHub.

adstage-david commented 10 years ago

Attempted to reproduce a few times, could get it to happen any more. I'm going to close since I can't even find a way to reliably reproduce it.

iconara commented 10 years ago

Ok, it was a bit weird that repairs fixed it. Next time it happens try to trap a few frames. You could also try enabling tracing and see if there's a difference between compressed and non-compressed queries in terms of how many cells returned from each node, for example.

iconara commented 10 years ago

If you're still using v1.2.0.pre0 you should update. I found and fixed a bug in the frame decoding (shouldn't have anything to do with this issue, it's just that you're the only one I know who's trying v1.2.0)

adstage-david commented 10 years ago

Awesome. Thanks for the update, and for your hard work building out features for this driver!

adstage-david commented 10 years ago

I think I may have found the actual cause of this - there might be some phantom nodes that initial connection gossip is adding to the list. I have all nodes listed upfront, so no extra nodes should be found in gossip, but sometimes I get some random ones without ip addresses showing up as also connected. I think this might be a config issue on our side? We tried scaling up our cluster a bit and then removed some nodes.

iconara commented 10 years ago

I don't think the phantom nodes explains it. The repair could explain it, but even that is odd. Even if the phantom nodes turn up in a peer discovery the driver won't be able to connect to them, and they shouldn't count towards consistency levels. Unless the nodes aren't actually down.

On the other hand the combination of adding nodes and then removing nodes without doing repairs could definitely explain how you'd get different results for the same query at different times (although you should have gotten it both with and without compression, not just with compression).

adstage-david commented 10 years ago

Sometimes it shows that it successfully connects and my read consistency is set to :one on a lot of queries so if somehow it ended up receiving a portion of the requests and returning empty data I think that could explain it, right?

The correlation with compression was more likely just that I have to reboot every time I turned it on/off, and there's only like 25% chance it finds these phantom nodes at boot as near as I can tell.

I didn't realize I could turn on logging until digging around in the client class - I think you might want to add that to the readme as a troubleshooting option.

iconara commented 10 years ago

Are your phantom nodes still running? I assumed that they'd been decommissioned. If it's the case that the nodes are running but are not part of the main cluster, then things could get really, really weird.

Good point about the logging, I should add that to the readme.

iconara / cql-rb

Possible Compressor Issue? #67