mjpearson / Pandra

Cassandra abstraction layer and keyspace scaffolder for PHP developers --- ABANDONED.
GNU Lesser General Public License v3.0
93 stars 11 forks source link

Suggestion - use built in thrift protocol load balancer for connectivity rather than tracking your own #13

Closed redsolar closed 14 years ago

redsolar commented 14 years ago

I noticed you have a pretty sophisticated tracker of up/down hosts for multi-host setups, with active/round robin/random support.

In reality, most will likely want to use true random (seldom r/r and almost never active) approach given the eventually consistent nature of cassandra, rather than anything else. Up to you of course, but it simplifies code maintenance a lot, and improves readability.

I am in the writing a highly optimized, performance oriented read/write cassandra CRUD for our needs, and noticed that thrift supports internal randomized (or r/r, but not "active") load balancing without much ado, and in addition does a very good job with downed host detection using APC as an intermediary.

All you need to do is use a TSocketPool object instead of TSocket during cassandra object initialization.

So in this case, $transport = new TBufferedTransport(new TSocket($host, $port), 1024, 1024); will need to be replaced with $transport = new TBufferedTransport(new TSocketPool($hosts, $port), 1024, 1024); where $hosts is an array of hostnames/IPs. $port can also be an array but in Thrift's case it's expected to be a 1:1 host->port relationship, or a single unified port, so if you have 5 hosts, you can use a single unified port (such as default 9160) or an array of 5 ports, otherwise things may not work as expected.

TSocket/TSocketPool also seems to track open()/isOpen() internally, so it's probably not needed to do that either.

If you desire round/robin approach, you can achieve that using setRandomize(false) method of TSocketPool.

See TSocket.php and TSocketPool.php for more options too.

mjpearson commented 14 years ago

Thanks for the well thought feedback - it's really appreciated :) In terms of design decisions, the Pandra::getClient() function is a code stub which is to be developed for 0.2, as I see not being able to select a specific node for read/write (ie: active node) in TSocketPool as a fairly significant limitation.

The major issue I had with TSocketPool as it stands is from an apps development perspective, in not being able to guarantee read consistency against a key immediately after it has been written in a random access arrangement (for consistency one or quorum). I love Cassandra's consistency model but can't help think that it's somewhat of a liability where user experience or data dependence between processes is an application requirement.

I marked this as 'APC round robin, named clusters, node auto discovery' in the roadmap for a 0.2 tag - it's not a verbose description of the (pretty big) issue but basically consists of :

Additionally now after your suggestion, I'll move the pool logic out of core and into a child of TSocket.php with some heavy borrowing of socket pools open() code (TSocketPool itself isn't extensible to this). This will be with the addition of named pools and named hosts for more fine grained control.

-Michael

redsolar commented 14 years ago

No problem.

For our needs, I am writing a simple(r) CRUD class, which is a lot more oriented towards performance, than extendability. It's more of storage bridge, so that developers can pass things back and forth without the need to understand cassandra, the consistency model, and the somewhat confusing data model for a newcomer who has worked mostly with relational data.

When I discovered your class, I like the idea, but that's not what I am after. Having broken a few nails with Thrift/Cassandra interaction so far, I figured it's good to give back to someone taking on the task of writing a more general, rapid development oriented class.

There will be more to come :)