pingles / clj-hector

Simple Cassandra client for Clojure
42 stars 19 forks source link

Multiple-row put functionality #20

Closed rplevy-draker closed 12 years ago

rplevy-draker commented 12 years ago

Hi,

This change adds a function called batch-put which accepts a vector of maps with :pk and :col-map keys. Compare to put, which has pk and col-map as args.

The key advantage here is that the multiple rows are added to what is executed, and executed in one put operation, instead of having to initiate multiple individual puts to accomplish the same.

Let me know what you think,

Rob

nickmbailey commented 12 years ago

Sorry it took me awhile to get to this. I have a couple of thoughts.

It seems like the 'rows' argument to batch-put could just be a single map rather than a vector of maps with predetermined keys. So for example

{< row key 1> 
  { <column map> }
 < row key 2>
  { <column map> }
}

My only other suggestion would be that we add a note to the batch-put method documentation mentioning that it is a good idea to keep the size of the batches fairly small. It is pretty easy for someone new to cassandra to get overzealous with the size of batch operations.

rplevy-draker commented 12 years ago

Thanks, I very much agree that a map is more appropriate. I added a couple more commits based on this feedback...

alanpeabody commented 12 years ago

For what it is worth, @rplevy-draker and I are seeing significant gains using large batch sizes in our use case on our setup.

nickmbailey commented 12 years ago

Looks good. Thanks for the pull request!

For what its worth, the main problem with using large batches is handling the failure case. On the cassandra side every row in a batch mutate becomes essentially it's own mutation. If any of these mutations fail, the entire operation will return an exception to the client. It is impossible at that point to know which parts of the large mutation succeeded so the entire mutation must be retried. While there is a slight performance boost when using them, you can usually get the same performance boost by parallelizing your code / using a connection pool.

There are a few other concerns with large batches since the entire request is held in memory at once, but usually you will hit the max thrift message barrier first.