Closed GoogleCodeExporter closed 9 years ago
You can use snappy_unittest for this; it's crude, but it works.
What's the intended use case? For disk-to-disk compression, usually you have
CPU time for something like gzip -1 instead.
Original comment by se...@google.com
on 18 Apr 2011 at 7:12
It would just be useful in the same way that lzop is useful, as a general
pipeline tool. E.g. disk-to-different-disk, process-to-ssh, etc.
Original comment by yaa...@gmail.com
on 22 Apr 2011 at 2:40
Original comment by se...@google.com
on 26 Apr 2011 at 12:56
Attached patch:
1) Adds streaming support, at least for streams created with the current
compressor
2) Creates command line tools =snzip= and =snunzip=. Both work solely with
standard input and output, making them most useful for pipes.
Resulting tool passes basic sanity checks (compress/decompress) and seems to
have acceptable performance. Has the limitation that for files larger than
64K, the reported file size will differ from the actual file size (since the
header must be output before the entire stream is recieved).
My C++ is rusty to nonexistant, so style/culture fixes are welcome.
Original comment by PavPanch...@gmail.com
on 17 Jun 2011 at 9:24
Attachments:
I made another patch, snzip.dif, which makes snzip.
It has similar options as gzip and bzip2 have as follows.
To compress file.tar:
snzip file.tar
Compressed file name is 'file.tar.snz' and the original file is deleted.
Timestamp, mode and permissions are not changed as possible as it can.
To compress file.tar and output to standard out.
snzip -c file.tar > file.tar.snz
or
cat file.tar | snzip > file.tar.snz
To uncompress file.tar.snz:
snzip -d file.tar.snz
or
snunzip file.tar.snz
Uncompressed file name is 'file.tar' and the original file is deleted.
Timestamp, mode and permissions are not changed as possible as it can.
If the program name includes 'un' such as snunzip, it acts as '-d' is set.
To uncompress file.tar.snz and output to standard out.
snzip -dc file.tar.snz > file.tar
snunzip -c file.tar.snz > file.tar
snzcat file.tar.snz > file.tar
cat file.tar.snz | snzcat > file.tar
If the program name includes 'cat' such as snzcat, it acts as '-dc' is set.
It have been tested on Linux and will work on other unix-like OSs.
As for Windows, it needs a getopt(3) compatible function, which is found in
many places as a public domain function.
Original comment by kubo.tak...@gmail.com
on 31 Jul 2011 at 12:12
Attachments:
Sorry, I failed to attach a correct file.
I attached a new one.
Original comment by kubo.tak...@gmail.com
on 31 Jul 2011 at 12:16
Attachments:
kubo your patch seems to work well; i did have to make one change for missing
'PACKAGE_STRING' and it was not being compiled correctly by default when i do
'make snzip', but the utility is exactly what i was looking for. I've also
added a -v to print out the version 1.0.3
Original comment by jehiah
on 12 Aug 2011 at 7:52
I made a new patch to support mingw32 and cygwin.
> kubo your patch seems to work well; i did have to make one change for missing
'PACKAGE_STRING' and it was not being compiled correctly by default when i do
'make snzip', but the utility is exactly what i was looking for. I've also
added a -v to print out the version 1.0.3
The missing macro 'PACKAGE_STRING' is defined in config.h by autoconf.
What version of autoconf do you use? I'm using autoconf 2.65.
I also prefer the '-v' option to print out the version. But gzip and bzip2
use it for verbose output option. So I didn't add it.
Original comment by kubo.tak...@gmail.com
on 21 Aug 2011 at 9:37
Attachments:
>> i did have to make one change for missing 'PACKAGE_STRING' and it was not
being compiled correctly by default when i do 'make snzip'
Could you provide your changes?
Original comment by and...@inffinity.com
on 22 Sep 2011 at 2:31
guys im a litle confused here, shouldnt the download of snappy.h allow you to
simply run this command:
snappy::Compress('/tmp/testfileinput', '/tmp/testfileoutput');
from within your c++ code? just two simple string inputs?
Original comment by mina.mou...@hotmail.com
on 24 Sep 2011 at 5:00
@mina.moussa :
snappy::Compress reads the full input, performs compression, writes the full
output.
imagine you have 5 TB of data to compress ... what do you do? well, you can buy
lots of ram and harddisks to swap to while the compression happens.
or better yet you can write a loop that reads in chunks of the file, runs them
through snappy::Compress and writes each chunk to an output file with a
container format that can later be decompressed by reading in discrete chunks
and decompressing them.
though i haven't played with these command line tools, if they behave properly
they should allow you to avoid having to come up with a container file format
and avoid writing loops for working on small chunks of the input at a time by
enabling streaming of input to the tool which would stream
compressed/decompressed output.
Original comment by dwil...@builderadius.com
on 26 Sep 2011 at 2:49
Yes, this is probably what you'd want for a command-line tool supporting pipes:
A simple framing format. For each block, probably the compressed length (the
uncompressed length is already in the format), perhaps some flags (EOF?), and
the CRC32c of the uncompressed data in that block.
Original comment by se...@google.com
on 26 Sep 2011 at 2:54
We have a simple framing format for streaming in the Java port of Snappy:
https://github.com/dain/snappy/blob/master/src/main/java/org/iq80/snappy/SnappyO
utputStream.java
Each 32k block is preceded by a 3-byte header, which is a 1-byte flag
indicating if the block is compressed or not, and a 2-byte length of the block.
Our main requirements were speed and the ability to concatenate compressed
files. The gzip format allows concatenation, but the common Java libraries
don't support this. We avoided writing a checksum for simplicity and speed.
The format doesn't currently have a header (magic number), but using a whole
byte for the compressed flag allows adding one later.
It would be nice to have a standard streaming format and tools. We're going to
try to get the Hadoop project to use this format too (which is our primary use
case).
Original comment by electrum
on 26 Sep 2011 at 5:24
The ability to concatenate is an interesting feature. Something that would
combine this with the ability to detect file format would be the best, though,
so you won't need yet another container format for that.
Not doing checksumming sounds a bit suboptimal; you can do it really cheaply on
modern CPUs (gigabytes per second per core), especially since the data is
already going to be in the L1 cache. Especially with multiple implementations
starting to float around (Java vs. C++ vs. Go), it's easy to get something
subtle going wrong.
Original comment by se...@google.com
on 28 Sep 2011 at 10:28
Steinar, you have a good point about checksums.
We updated the stream format to contain the masked CRC32C of the input data,
providing protection against corruption or a buggy implementation. We also
added a file header "snappy\0", which happens to be the same size (7 bytes) as
the block header. The file header may procede any block header one or more
times, thus supporting concatenation including "empty" files (that contain only
the file header).
See the SnappyOutputStream link above for the formal description. Does this
format sound reasonable to standardize?
Original comment by electrum
on 30 Sep 2011 at 8:32
OK, this starts to sound pretty good to me -- I should probably get somebody
else in here to look at it as well, but it starts to become reasonable.
Some questions (mostly nits):
- What do you need the uncompressed/compressed flag for? In what situations would you want to store the data uncompressed?
- Is the length 16-bit signed or unsigned? Why is it 32768 and not 32767 or 65535?
- Should the lengths really be stored big-endian, when all other numbers in Snappy are stored little-endian?
- Can you verify that the CRC32c polynomial you're using is compatible with what the SSE4 CRC32 instruction computes? It sounds reasonable that if we're defining a new format, an implementation in native code should be able to make use of that instruction.
Thanks!
Original comment by se...@google.com
on 3 Oct 2011 at 9:50
Some drive-by comments:
For the uncompressed/compressed flag, leveldb's tables uses snappy, but if the
compression doesn't save more than 12.5% of the bytes, then the block is left
uncompressed on disk:
http://code.google.com/p/leveldb/source/browse/table/table_builder.cc#147
For checksums, it looks like github.com/dain is using the same CRC32c-based
checksum as leveldb:
https://github.com/dain/snappy/blob/master/src/main/java/org/iq80/snappy/Crc32C.
java
http://code.google.com/p/leveldb/source/browse/util/crc32c.h#28
Original comment by nigel.ta...@gmail.com
on 3 Oct 2011 at 10:32
Here is a concrete proposal. It is possibly too complicated, but it does let a
.snappy file start with a 7-byte magic header, and also allows concatenating
multiple .snappy files together.
The byte stream is a series of frames, and each frame has a header and a body.
The header is always 7 bytes. The body has variable length, in the range [0,
65535].
The first header byte is flags:
- bit 0 is comment,
- bit 1 is compressed,
- bit 2 is meta,
- bits 3-7 are unused.
The comment bit means that the rest of the header is ignored (including any
other flag bits), and the body has zero length. Thus, "sNaPpY\x00" is a valid
comment header, since 's' is 0x73.
For non-comment headers, the remaining 6 bytes form a uint16 followed by a
uint32, both little-endian. The uint16 is the body length. The uint32 is a
CRC32c checksum, the same as used by leveldb. This differs from the Java code
linked to above in that it's little-endian (like the rest of Snappy), and the
maximum body length is 65535, not 32768.
The compressed bit means that the body is Snappy-compressed, and that the body
length and checksum refer to the compressed bytes. If the bit is off, the body
is uncompressed, and the body length and checksum refer to the uncompressed
bytes. Each frame's compression is independent of any other frame.
The meta bit means that the body is metadata, and not part of the data stream.
This is a file format extension mechanism, but there are no recognized
extensions at this time.
A conforming decoder can simply skip every frame with the comment or meta bits
set.
Original comment by nigel.ta...@gmail.com
on 4 Oct 2011 at 11:03
I've written a Go implementation of that proposal at
http://codereview.appspot.com/5167058. It could probably do with a few more
comments, but as it is, it's about 250 lines of code.
I added an additional restriction that both the compressed and uncompressed
lengths of a frame body have to be < 65536, not just the compressed length.
This restriction means that I can allocate all my buffers up front. Thus, once
I've started decoding, I don't need to do any extra mallocs regardless of how
long the stream is, or whether the uncompressed stream data looks like
"AAAAAAAA...".
Original comment by nigel.ta...@gmail.com
on 4 Oct 2011 at 1:15
Answers to Steinar's questions:
Why the uncompressed/compressed flag? As mentioned above by Nigel, for the
same reason that leveldb does it. Because Snappy's goal is speed, and doesn't
compress well compared to slower algorithms like zlib, it makes sense to
sacrifice a little more space for speed. (We chose the same cutoff as leveldb,
12.5%, but the cutoff is independent of the format.)
The 16-bit length is unsigned. Why 32768 and not 65535? Two reasons. First,
it matches Snappy's internal block size. Because Snappy will split larger
blocks, the only potential gain is fewer chunk headers. Second, it is a power
of two. If you use 65535 and compress 64k (65536) bytes of data, then you end
up with two chunks, with the second chunk being only 1 byte.
Should the length be big endian or little endian? We chose big endian because
that's common for file formats and network protocols. Given that Snappy uses
little endian, I have no objections to changing it.
The CRC32C was chosen specifically to be compatible with the SSE4 instruction.
It's a bug if it's not. The Java implementation uses the CRC32C code from
Hadoop, which we haven't verified extensively, but it matched in cursory checks
against the Python leveldb reader.
Original comment by electrum
on 5 Oct 2011 at 6:03
Original issue reported on code.google.com by
nathan.o...@gmail.com
on 18 Apr 2011 at 3:41