pyr / cyanite

cyanite stores your metrics
http://cyanite.io
Other
446 stars 79 forks source link

Misuse of BATCH statements #232

Closed fromanator closed 8 years ago

fromanator commented 8 years ago

Looks like there was a recent change where you can batch a group of writes to Cassandra. Generally this is an anti-pattern and will result in slower write times, and potentially destabilize a cluster depending on the severity of abuse (in this case the batch size and how many partitions the writes would span).

Instead I would encourage to only allow UNLOGGED BATCH when there are writes being performed to a single partition in Cassandra. See this great article for the details.

ifesdjeen commented 8 years ago

We've discussed a related issue with @pyr and decided we'll go a bit different way and batch up / compact the data in a slightly different way. I'll roll back batches in a couple of days.

UNLOGGED BATCH won't work in our case, since in majority of cases we're writing to different partitions.

fromanator commented 8 years ago

@ifesdjeen yeah if you are not using any load balancing at the client (like using token away with round robin) you would see some performance increases for relatively low batch sizes. However by doing so you are essentially off loading the client work to your coordinator node. If the batch sizes are relatively large, and span several partitions, then you can put a significant load on your coordinator node and could bring it down.

Usually the best way to get write performance in Cassandra would be to have your client do the load balancing (with the aforementioned token aware + round robin) and group writes by partition via unlogged batches. And in your case, if you can't effectively group the writes by partitions then just to each write separately (preferably async).

ifesdjeen commented 8 years ago

Async are arguably not the best way to do it. Queuing things is the most straightforward but not always the most performant way to do it, because of context switches (same data will be processed on a different thread) and cache off-loading etc. Unfortunately, currently we're not optimised for thread per core anyways.

As I mentioned, we'll just reduce amount of writes we need to do at all (also, configurable) instead of trying to batch up things. Thanks for pointing that out!

fromanator commented 8 years ago

@ifesdjeen there are a couple of ways the writes could be optimized, but it requires some work on the client side to do. The general idea would be to group writes by the partition key, or a more involved method would be to group them by their token ranges.

Option 1 would involve sorting the data that you need to write by the partition key (e.g. the metric name) then group them all together for an UNLOGGED BATCH insert.

Option 2 would require you to get the token ranges of the cluster. This can be done with the Datastax driver via Cluster.getMetadata() then with that metadata you can get the token ranges. Once you get the token ranges, you would then need to calculate the token for each data that you need to insert. You could use the com.datastax.driver.core.Token.getFactory("Murmur3Partitioner") (assuming that the cluster is configured to use a murmur3 partitioner) and provide it the string for your partition key to generate these. Then once you know the token ranges. and the token, you would want to group the data to write that would all fall within the same token range (and hence be written to a single server) and do an UNLOGGED BATCH for them.

Option 1 would be by far easier to implement, but as you mentioned the inserted data is usually not sharing the same partition key. That's where option 2 might suit the needs better, however there would be more nuisances to it. For option 2 I would keep what you added of being able to tune the write sizes, however you would want to make sure that you refresh the state of the token ranges periodically (since they can change).

ifesdjeen commented 8 years ago

Great, thanks for the details!

We decided to take more application/domain specific approach. Which would actually yield best performance in our case, as far as we can say.

Also #236 is merged, batches patch reverted.