Closed GoogleCodeExporter closed 9 years ago
This is really inherent in the format; you couldn't change this without also
breaking existing decompressors.
In any case, it's sort of meaningless compressing such large chunks, giving
that the format internally breaks all data up in 64 kB parts that are processed
individually anyway. In other words, it will buy you nothing in increased
compressibility.
Original comment by se...@google.com
on 14 May 2013 at 9:21
The point is that this bug is undocumented and hidden from the user. One of two
resolutions seem reasonable to me:
1. Change all the apis from size_t to uint32 to avoid false advertising
2. Add automatic 4GB chunking to your wrapper layer
Thanks,
-Dan
Original comment by danadler...@gmail.com
on 15 May 2013 at 1:38
#2 is out of the question, really, since that would break the format (the
length is stored only once). We could probably change the types in the APIs,
but that would mean ABI breakage, which is also bad.
Original comment by se...@google.com
on 15 May 2013 at 1:42
Thanks. For a future version I would suggest expanding the format as follows:
check the leading uint32: if it is 0 then assume it's the new format followed
by a 64 bit size and otherwise fall back to the existing 32 bit algo. That
would salvage all existing stored data and allow you to extend the format. I
would also add an extra version number into the header so that you can change
the format in the future without breaking backwards compatibility.
Original comment by danadler...@gmail.com
on 15 May 2013 at 1:54
We're really not going to break the format for something people shouldn't
really do anyway, sorry -- if you have that much data, stream it somehow
instead of doing one Compress() call. The format has been stable for over eight
years now, and forwards- and backwards-compatibility is an important feature.
I could probably add a comment saying that there's a hard limit of 4GB, though?
Original comment by se...@google.com
on 15 May 2013 at 1:57
I would argue that 8 years ago there was not such a proliferation of server
hardware with tens or hundreds of GB of RAM. Compressing an array greater than
4GB in memory today is commonplace, which is why I was so stunned by such a
"year 2000" bug. But anyway, we get the message. No change is sight. Will
pursue other alternatives. Thanks for your quick responses.
Original comment by danadler...@gmail.com
on 15 May 2013 at 2:20
I suppose that having a 32-bit limitation is because it would be a waste of
space to use 64-bit pointers internally for the relatively small number of
cases that are going to use it. For a compression library this is generally
unacceptable.
If you want to compress large datasets exceeding 4 GB the normal thing is to
split them in chunks, compress them separately, and then decompress them and do
the join by hand. It is more work, but the compression ratios will benefit
quite a lot.
Original comment by fal...@gmail.com
on 20 May 2013 at 5:22
faltet: It's not actually about that; we already use 16-bit pointers and not
32-bit for the compression itself (we split into 64 kB chunks that are
compressed individually), so the encoder could handle it with minimal extra
work, but since the decoder happened not to support it at some point, we don't
want to introduce backwards incompatibility for a very small gain.
In any case, it's pretty clear that this is not going to change, so I'm closing
the bug.
Original comment by se...@google.com
on 21 May 2013 at 9:34
Original issue reported on code.google.com by
danadler...@gmail.com
on 14 May 2013 at 9:17