markcox / snappy

Automatically exported from code.google.com/p/snappy
Other
0 stars 0 forks source link

Does not handle data larger than 4GB #76

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
On 64 bit linux size_t is 64 bits and the compress/decompress api happily 
accepts buffers greater than 2^32:

size_t SNAPPY_FUNC_EXPORT Compress(const char* input, size_t input_length, 
string* output);

bool SNAPPY_FUNC_EXPORT Uncompress(const char* compressed, size_t 
compressed_length, string* uncompressed);

but then the code calls GetUncompressedLength() which tries to read a uint32 
from the compressed buffer and fails, so snappy::Uncompress() always fails for 
data sizes > 4GB.

Original issue reported on code.google.com by danadler...@gmail.com on 14 May 2013 at 9:17

GoogleCodeExporter commented 9 years ago
This is really inherent in the format; you couldn't change this without also 
breaking existing decompressors.

In any case, it's sort of meaningless compressing such large chunks, giving 
that the format internally breaks all data up in 64 kB parts that are processed 
individually anyway. In other words, it will buy you nothing in increased 
compressibility.

Original comment by se...@google.com on 14 May 2013 at 9:21

GoogleCodeExporter commented 9 years ago
The point is that this bug is undocumented and hidden from the user. One of two 
resolutions seem reasonable to me:
1. Change all the apis from size_t to uint32 to avoid false advertising
2. Add automatic 4GB chunking to your wrapper layer

Thanks,
-Dan

Original comment by danadler...@gmail.com on 15 May 2013 at 1:38

GoogleCodeExporter commented 9 years ago
#2 is out of the question, really, since that would break the format (the 
length is stored only once). We could probably change the types in the APIs, 
but that would mean ABI breakage, which is also bad.

Original comment by se...@google.com on 15 May 2013 at 1:42

GoogleCodeExporter commented 9 years ago
Thanks. For a future version I would suggest expanding the format as follows: 
check the leading uint32: if it is 0 then assume it's the new format followed 
by a 64 bit size and otherwise fall back to the existing 32 bit algo. That 
would salvage all existing stored data and allow you to extend the format. I 
would also add an extra version number into the header so that you can change 
the format in the future without breaking backwards compatibility.

Original comment by danadler...@gmail.com on 15 May 2013 at 1:54

GoogleCodeExporter commented 9 years ago
We're really not going to break the format for something people shouldn't 
really do anyway, sorry -- if you have that much data, stream it somehow 
instead of doing one Compress() call. The format has been stable for over eight 
years now, and forwards- and backwards-compatibility is an important feature.

I could probably add a comment saying that there's a hard limit of 4GB, though?

Original comment by se...@google.com on 15 May 2013 at 1:57

GoogleCodeExporter commented 9 years ago
I would argue that 8 years ago there was not such a proliferation of server 
hardware with tens or hundreds of GB of RAM. Compressing an array greater than 
4GB in memory today is commonplace, which is why I was so stunned by such a 
"year 2000" bug. But anyway, we get the message. No change is sight. Will 
pursue other alternatives. Thanks for your quick responses.

Original comment by danadler...@gmail.com on 15 May 2013 at 2:20

GoogleCodeExporter commented 9 years ago
I suppose that having a 32-bit limitation is because it would be a waste of 
space to use 64-bit pointers internally for the relatively small number of 
cases that are going to use it.  For a compression library this is generally 
unacceptable.

If you want to compress large datasets exceeding 4 GB the normal thing is to 
split them in chunks, compress them separately, and then decompress them and do 
the join by hand.  It is more work, but the compression ratios will benefit 
quite a lot. 

Original comment by fal...@gmail.com on 20 May 2013 at 5:22

GoogleCodeExporter commented 9 years ago
faltet: It's not actually about that; we already use 16-bit pointers and not 
32-bit for the compression itself (we split into 64 kB chunks that are 
compressed individually), so the encoder could handle it with minimal extra 
work, but since the decoder happened not to support it at some point, we don't 
want to introduce backwards incompatibility for a very small gain.

In any case, it's pretty clear that this is not going to change, so I'm closing 
the bug.

Original comment by se...@google.com on 21 May 2013 at 9:34