hector-client / hector

a high level client for cassandra
http://prettyprint.me/2010/02/23/hector-a-java-cassandra-client/
MIT License
644 stars 299 forks source link

ISSUE:601 - MultigetSliceIterator Implementation for "MultigetSliceQuery" #604

Closed vinaykumarchella closed 11 years ago

vinaykumarchella commented 11 years ago

This implementation is for Issue: https://github.com/hector-client/hector/issues/601 MultigetSliceIterator, iterates over MultigetSliceQuery result set, refreshing until all qualifying rows are retrieved based on input keys. This iterator is optimized for parallelism to overcome the limitations of Thrift window message sizes and performance issues on huge row key searches.

INTRODUCTION Currently there is no iterator for "MultigetSliceQuery" similar to CSI (ColumnSliceIterator) and KeyIterator. There are many users and use cases for MultigetSliceQuery, having iterator for "MultigetSliceQuery” solves following issues/ concerns (1) When there are too many rows to retrieve from Cassandra and the data retrieval size is more than “thrift_max_message_length_in_mb” in Cassandra.yaml (Default size is 16 MB), MultigetSliceQuery leads to “socket” exception or “timeout” exception. MiltigetSliceQuery with iterator capabilities will resolve this issue (2) Querying Cassandra with many rowkeys leads to slow performance issue, Iterator with parallelization capabilities will resolve the performance issues

IMPLEMENTATION MultigetSliceIterator iterates over the MultigetSliceQuery result set, refreshing until all qualifying rows are retrieved based on input keys. This iterator is optimized for parallelism with the help of maxThreadCount option provided. If maxThreadCount is not provided, it calls Cassandra sequentially with the set(maxRowCountPerQuery) of row keys at a time unless all keys are completed. E.g., maxRowCountPerQuery is 100 and maxThreadCount 5, calls Cassandra 5 times using 5 threads/ parallelism for total of 500 keys. Configuring it not to use Threads and call Cassandra 5 times sequentially instead of parallelism can be done by not setting maxThreadCount or setting it 0

maxRowCountPerQuery – This parameter helps in not getting into issue of “socket” exception or “timeout” maxThreadCount – This parameter helps in improving the performance of MultigetSliceQuery by enabling the parallelization