Change default LZ4 chunk size to 4k

dorlaor commented 8 years ago

We got a report which indicates it boosted performance of reads over GCE persistent drive (non local) by huge factor

nyh commented 8 years ago

This is an interesting suggestion. Cassandra's default LZ4 chunk size is 64 KB, and we took the same default, but it's ineed worth rethinking what's the tradeoffs involved:

In the past, having a small chunk size meant we went to the disk reading those small chunks one by one. But this was fixed in commit 2f565777945801f99622b3155cdbbae702d52bbd, and now when we intend to read large swathes of data, we read large (by default 128KB) buffers even if the compressed chunks are much smaller. So now we'll probably not lose much sequential-read performance by using small compressed chunks - while we'll obviously gain performance in random-access reads.

Another tradeoff involved in the chunk size is compression ratio: Presumably, better compression can be attained when compressing bigger chunks. We need to see in practice how much compression ratio we lose by switching from 64 KB chunks to 4 KB chunks.

So I think before deciding the default chunk size, we should consider the different tradeoffs: small reads vs. sequential reads (and writes), compression ratio, etc., in some relevant test case. Of course, whatever we choose will only be the default - people can still control the chunk size separately for each CF, like they have been doing for Cassandra. So the new default we choose does not have to fit every use case.

dorlaor commented 8 years ago

On Wed, Jun 22, 2016 at 12:09 AM, nyh notifications@github.com wrote:

This is an interesting suggestion. Cassandra's default LZ4 chunk size is 64 KB, and we took the same default, but it's ineed worth rethinking what's the tradeoffs involved:

In the past, having a small chunk size meant we went to the disk reading those small chunks one by one. But this was fixed in commit 2f56577 https://github.com/scylladb/scylla/commit/2f565777945801f99622b3155cdbbae702d52bbd, and now when we intend to read large swathes of data, we read large (by default 128KB) buffers even if the compressed chunks are much smaller. So now we'll probably not lose much sequential-read performance by using small compressed chunks - while we'll obviously gain performance in random-access reads.

Another tradeoff involved in the chunk size is compression ratio: Presumably, better compression can be attained when compressing bigger chunks. We need to see in practice how much compression ratio we lose by switching from 64 KB chunks to 4 KB chunks.

So I think before deciding the default chunk size, we should consider the different tradeoffs: small reads vs. sequential reads (and writes), compression ratio, etc., in some relevant test case. Of course, whatever we choose will only be the default - people can still control the chunk size separately for each CF, like they have been doing for Cassandra. So the new default we choose does not have to fit every use case.

You're right. However, the problem is lack of disk cache. So if we need to access a 100b object we'll need to read 64k and throw them later. Changing it to 4k gave a factor of 60x improvement. If we solve the cache, or add LRU and MRU and deserialize the un-needed entries, it will be better (but requires much more work)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla/issues/1377#issuecomment-227572786, or mute the thread https://github.com/notifications/unsubscribe/ABp6Rc2iIr41jCvEOjm5IRVe8w_-WQZiks5qOFMGgaJpZM4I7JVA .

nyh commented 8 years ago

Just remember that another option is to disable compression completely, so we need to confirm that at 4 KB chunks, the compression is still good enough to bother with. The results in https://www.percona.com/blog/2016/03/09/evaluating-database-compression-methods/ suggest that this is indeed the case: They report compression ratio of 2 for 64-KB chunks and 1.75 for 4-KB chunks - which is probably indeed a good compromise.

dorlaor commented 8 years ago

On Wed, Jun 22, 2016 at 12:47 AM, nyh notifications@github.com wrote:

Just remember that another option is to disable compression completely, so we need to confirm that at 4 KB chunks, the compression is still good enough to bother with. The results in https://www.percona.com/blog/2016/03/09/evaluating-database-compression-methods/ suggest that this is indeed the case: They report compression ratio of 2 for 64-KB chunks and 1.75 for 4-KB chunks - which is probably indeed a good compromise.

W/o compression the performance were better but it's a ratio of 1..

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla/issues/1377#issuecomment-227582736, or mute the thread https://github.com/notifications/unsubscribe/ABp6RYhmquZEv3KC3w7_woloGzWCT1-Pks5qOFv8gaJpZM4I7JVA .

avikivity commented 8 years ago

A cache only helps if the data fits in it. Random reads from a 1 TB dataset on a 64 GB machine will have a ~0% hit rate even if all of that memory is dedicated to cache. So reducing the compression granularity is a much better option.

slivne commented 8 years ago

The question is is this change making sense not only for cassandra-stress but also for other users. reading 100b may be an extreme that does not fit the usual user case.

We have a different case in which we are targeting inserting rows of 60K - this load may suffer from the change to smaller block size.

On Wed, Jun 22, 2016 at 10:34 AM, Avi Kivity notifications@github.com wrote:

A cache only helps if the data fits in it. Random reads from a 1 TB dataset on a 64 GB machine will have a ~0% hit rate even if all of that memory is dedicated to cache. So reducing the compression granularity is a much better option.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla/issues/1377#issuecomment-227665699, or mute the thread https://github.com/notifications/unsubscribe/ADThCIjRbLOc1bg9FIZZGQyayASI0whtks5qOOVqgaJpZM4I7JVA .

avikivity commented 8 years ago

The compression ratio will be reduced, but even in that case the I/O bandwidth will be halved.

For really large rows you could increase the block size and gain an extra 10%-20% disk space. 4k seems to be a lot more robust, you lose a little compression ratio, but gain sane behavior for partitions that are smaller than 100k.

nyh commented 8 years ago

On Wed, Jun 22, 2016 at 10:40 AM, Shlomi Livne notifications@github.com wrote:

The question is is this change making sense not only for cassandra-stress but also for other users. reading 100b may be an extreme that does not fit the usual user case.

We have a different case in which we are targeting inserting rows of 60K - this load may suffer from the change to smaller block size.

Even with large reads (and 60K isn't that large, it's just one of the old default chunk size, so one read may still overlap more than one chunk and read much too much) I believe it won't suffer a significant performance degredation because of that improvement which I mentioned earlier where the read from disk no longer happens at these chunk sizes. There will be slightly more overhead in handling more chunks, but I don't think it should be very significant. There will also be a reduction of compression ratio, unfortunately, but the link I provided earlier claims the file size will grow at around 10% if we switch from 64 KB to 4 KB compression blocks. There's obviously a tradeoff here, but a person who doesn't like our choice in this tradeoff can always set the chunk_length_kb compression parameter manually.

The following trivial patch can be tested:

diff --git a/compress.hh b/compress.hh
index 46772a1..701c7e2 100644
--- a/compress.hh
+++ b/compress.hh
@@ -32,7 +32,7 @@ enum class compressor {

 class compression_parameters {
 public:
-    static constexpr int32_t DEFAULT_CHUNK_LENGTH = 64 * 1024;
+    static constexpr int32_t DEFAULT_CHUNK_LENGTH = 4 * 1024;
     static constexpr double DEFAULT_CRC_CHECK_CHANCE = 1.0;

     static constexpr auto SSTABLE_COMPRESSION = "sstable_compression";
(

avikivity commented 8 years ago

Note, this still leaves some performance on the table if the disk supports 512-byte reads (and many do). But I think that 4k is the right tradeoff.

belliottsmith commented 8 years ago

Something else you might want to consider, once the default size is dropped, is flipping around the meaning of compression block size. Assuming you honour Cassandra's definition, i.e. of the size of the uncompressed data, then tearing means you're likely performing one wasted disk operation per read. If instead 4K were the compressed size, it would be a guaranteed one (or 8), reducing the cost by between 12.5% and 50%.

This would require some upstream changes to the compressors, but not overly onerous ones.

avikivity commented 8 years ago

Right now I want to keep 2.1 SSTables compatible. We're considering an alternative where only blobs are compressed (assuming the metadata is already encoded efficiently), and the entire issue goes away.

belliottsmith commented 8 years ago

The value diminishes, but I'm not sure it entirely goes away. Some data sets likely have a lot of numerical and textual data that is highly repetitious, in small (non-blob) values. Once the metadata becomes efficiently encoded, even modestly sized values and blobs can become a significant chunk of the disk space.

Unless by blobs you mean to have a separate arena for the cell values of a collection of rows, associated with some kind of data page, that can be compressed together.

There are probably some more advanced schemes like having distinct compressors for each column, but they still ideally need some contextual spatial unit over which to accumulate statistical knowledge (I guess that could be the row, though)

avikivity commented 8 years ago

No, I meant each blob individually. Certainly it's less efficient than having a full dictionary.

I believe most of the value of compression in C* is due to the intensive duplication in 2.x SSTables, and due to people wrapping their data in JSON. The former can be fixed with a better format, the latter with a server that doesn't penalize proper modeling and by using UDTs. There's still some on the table, but you'll never get all of it with a small block size, and the random access penalty for a large block size is just horrible. So I think we can leave that out.

belliottsmith commented 8 years ago

Certainly the 2.x line is largely using compression as a crutch, and the large block size is horrific for performance - although I think it is a hangover from spinning rust, where it actually makes a great deal of sense to amortise the seek costs in case the larger quantity is useful.

But that doesn't mean users won't be clamouring for compression for their data when all of those things are fixed, or that compression wouldn't give some of them real value. There will always be users and use cases who value space over bandwidth.

avikivity commented 8 years ago

True. I guess we can offer large-block compression for them, and with compressed blocks occupying disk blocks perhaps. We'll probably integrate the compression offsets into the index to avoid an O(n) in-memory data structure.

belliottsmith commented 8 years ago

Absolutely. There's a whole world of interesting things to do structurally with sstables; their current form is pretty basic. The offsets can also themselves be compressed.

avikivity commented 8 years ago

Fixed by 164c76032416d0f826a2ef14dfb6f4c97634458a.

nyh commented 8 years ago

We forgot to close this issue.

scylladb / scylladb

Change default LZ4 chunk size to 4k #1377