Multiple strange issues with long-running scripts

d0ct0rvenkman commented 12 years ago

Greetings,

The discussion of my issues would likely be better suited to a mailing list or forum environment, but I can't seem to find any that are associated with the project. If they exist and I've just missed them somehow, please point me in their direction.

I've been converting a server monitoring application to utilize Cassandra, and I've been using the CPCL to bridge the gap between PHP and Cassandra. For the most part, it's been pretty painless, but I've run into some really strange issues that I haven't been able to isolate or correct. I'm still fairly new to Cassandra, so I'm not really certain whether I'm finding bugs somewhere, or whether I'm just "Doing It Wrong (tm)".

The entire application is running in a development environment on a collection of CentOS 5.8 64-bit Xen instances with the php53 (specifically: php53-5.3.3-7.el5_8) RPMs provided by CentOS. The Cassandra cluster started as 8 VMs running on a single hardware node with 2 VCPUs and 2GB RAM each, and has grown to 12 VMs running on three hardware nodes, with 2 VCPUs and 4GB of RAM each. Cassandra is installed via the DataStax RPMs (apache-cassandra1-1.0.9-1 , to be specific). I've done no tweaking to the configuration other than setting the initial tokens and cluster names on each server, and to configure the listen/RPC addresses. I'm using a self-compiled version of the thrift binary library built from the code provided in the CPCL. The CPCL code I'm using seems to correlate to commit 766dc14efe3731965a7da4c48faded317e8097a4 from the git repo (retrieved via kallaspriit-Cassandra-PHP-Client-Library-766dc14.zip on 2012/06/04).

My app accesses Cassandra both through short-lived HTTP-based PHP scripts and through long-lived PHP scripts that run via the command line. The problems exist in the latter set of scripts, seemingly after the script has done a good number of large-ish operations from cassandra. By "large-ish", I mean a get or set of a single key with 10,000-30,000 columns or so. These issues all occur within scripts that repeatedly retrieve a bunch of data from Cassandra, process it in some way, then store the processed data back into Cassandra. So far, I'm seeing multiple distinct issues.

1) At times, the data retrieved from Cassandra is wrong. This scenario seems to occur after the script has been running for a while (an hour or more), and many get/set operations have taken place. The "wrong" data appears in the form of data that was retrieved previously, via a get request for a different key. This problem occurs infrequently and seemingly at random. I've taken good care to make sure the variables I'm fetching data into are sanitized, and I'm catching exceptions and acting accordingly. I've been able to lessen the frequency of this problem by periodically closing the connections to Cassandra, destroying the Cassandra object, and re-initializing it, but that doesn't prevent it completely. It makes me wonder if there is some sort of buffering issue at some level, where a request to that goes to a particular Cassandra node doesn't completely clear out of a buffer, and then is inadvertently re-used later on.

2) Set operations seem to hang for some unknown reason. This happens after the script has retrieved data, processed it, and is trying to put the results back into Cassandra. The set operation just hangs until one of two things is done: 1) Ctrl-Z the running script to interrupt its execution, then issue the "fg" command to start it back up again. 2) attach to the hung process with strace, which unblocks the script somehow. Without one of these intervening actions, the script will hang indefinitely. It's not something that happens with every set operation (probably with < 5% of set operations), but it happens enough to be very annoying.

3) lots of seemingly random communications failures that manifest themselves as exceptions like 'Failed calling "describe_keyspace" the maximum of X times' or 'Failed calling "get_slice" the maximum of X times'. I think I was able to correct this issue by reducing I/O contention in the cassandra cluster, but I'm not positive yet. In my initial testing setup, I had 8 Cassandra VMs all on the same hardware node, so I/O contention was pretty high at times. Moving the VMs to different hardware nodes has seemingly fixed this, but I can't say for sure, since a lot of the scripts that suffer the two issues above are the same scripts that have this problem, and they've been shut down for the time being while I try to figure things out.

Since I'm not seeing any issues whatsoever with the short-lived scripts that run via HTTP requests, my guess is that data objects are becoming "cluttered" over time as the long-running scripts do their thing, and that clutter is somehow causing the issues I'm seeing above. I've tried digging into the various classes involved, but being completely unfamiliar with the Cassandra/thrift binary protocols, I can only dig so far.

I realize I'm laying out some rather ambiguous and ill-defined issues here, so if you need specifics, please let me know what you need. Being fairly new to Cassandra, I wouldn't be at all surprised if I'm just missing something. Any insight is welcome.

drewbroadley commented 11 years ago

I can confirm I have this behavior as well.

Ubuntu 12.04 LTS Cassandra 1.2.0 PHP 5.4.9-3~precise+1

Cassandra in a three cluster environment.

kallaspriit commented 11 years ago

I'm not sure how to debug this sort of thing. The first issue can come from the nature of how a distributed Cassandra database works and some updates might not have propagated yet.

Do the cassandra instances produce any logs? Do the second and third issue manifest themselves under low load conditions?

d0ct0rvenkman commented 11 years ago

Unfortunately, I gave up on the CPCL a while back because of the issues above, so I can't provide much more than my recollections on the issues at this point. :\

For the first issue, I don't believe this is an issue of nodes being out of sync, or some other distributed writing behavior. The data coming back from the get request was valid and correct data, but for the wrong key. Example: In one iteration through a loop, the script gets a range of data from "Key12345". In the next iteration, it grabs data from "Key23456". In the second iteration, valid and correct data from "Key12345" was showing up in the results for "Key23456". I did every possible thing I could think of to sanitize the variable those results were being stored into, and nothing eliminated the problem. It was not consistently happening with every get (only a small percentage), and only seemed to happen when the script had been running for some time (many minutes to hours), but happened enough that the data I was processing (time series monitoring data) was visibly and obviously wrong. That's why I suspected some sort of buffer corruption somewhere in the call stack.

I have no earthly idea what would cause issue #2. If I remember correctly, I tested a number of different concurrency scenarios to see if scripts were somehow interfering with each other. It seemed that even when the overall cassandra cluster load was low, and there was only copy of my processing script running, issues #1 and #2 would still occur.

I wouldn't be at all surprised if #3 was caused by my testing environment. Perhaps drew can elaborate on which issues he's seeing to provide some more light on the situation?

drewbroadley commented 11 years ago

@kallaspriit unfortunately not, all I have is a strack trace (rather large) to give you insight. Let me know if you'd like that ?

@d0ct0rvenkman what other library are you using ? I'm unfortunately going to have to head in that direction.

d0ct0rvenkman commented 11 years ago

@drewbroadley: I started using phpcassa. It offers a very similar interface in terms of function calls, and didn't exhibit the errors described above.

LordStephen commented 11 years ago

I confirm having the issue 1 and 3 on a single node cluster with similar configuration.

kallaspriit / Cassandra-PHP-Client-Library

Multiple strange issues with long-running scripts #18