wang-nima / snappy

Automatically exported from code.google.com/p/snappy
Other
0 stars 0 forks source link

Command line tool #34

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This library would likely be directly useful to a lot more people if a simple 
command line program to compress/decompress from stdin to stdout was included.

Original issue reported on code.google.com by nathan.o...@gmail.com on 18 Apr 2011 at 3:41

GoogleCodeExporter commented 9 years ago
You can use snappy_unittest for this; it's crude, but it works.

What's the intended use case? For disk-to-disk compression, usually you have 
CPU time for something like gzip -1 instead.

Original comment by se...@google.com on 18 Apr 2011 at 7:12

GoogleCodeExporter commented 9 years ago
It would just be useful in the same way that lzop is useful, as a general 
pipeline tool.  E.g. disk-to-different-disk, process-to-ssh, etc.

Original comment by yaa...@gmail.com on 22 Apr 2011 at 2:40

GoogleCodeExporter commented 9 years ago

Original comment by se...@google.com on 26 Apr 2011 at 12:56

GoogleCodeExporter commented 9 years ago
Attached patch:

1) Adds streaming support, at least for streams created with the current 
compressor
2) Creates command line tools =snzip= and =snunzip=.  Both work solely with 
standard input and output, making them most useful for pipes.

Resulting tool passes basic sanity checks (compress/decompress) and seems to 
have acceptable performance.  Has the limitation that for files larger than 
64K, the reported file size will differ from the actual file size (since the 
header must be output before the entire stream is recieved).

My C++ is rusty to nonexistant, so style/culture fixes are welcome.

Original comment by PavPanch...@gmail.com on 17 Jun 2011 at 9:24

Attachments:

GoogleCodeExporter commented 9 years ago
I made another patch, snzip.dif, which makes snzip.
It has similar options as gzip and bzip2 have as follows.

To compress file.tar:
 snzip file.tar

  Compressed file name is 'file.tar.snz' and the original file is deleted.
  Timestamp, mode and permissions are not changed as possible as it can.

To compress file.tar and output to standard out.
 snzip -c file.tar > file.tar.snz
or
 cat file.tar | snzip > file.tar.snz

To uncompress file.tar.snz:

 snzip -d file.tar.snz
or
 snunzip file.tar.snz

  Uncompressed file name is 'file.tar' and the original file is deleted.
  Timestamp, mode and permissions are not changed as possible as it can.

  If the program name includes 'un' such as snunzip, it acts as '-d' is set.

To uncompress file.tar.snz and output to standard out.

 snzip -dc file.tar.snz > file.tar
 snunzip -c file.tar.snz > file.tar
 snzcat file.tar.snz > file.tar
 cat file.tar.snz | snzcat > file.tar

  If the program name includes 'cat' such as snzcat, it acts as '-dc' is set.

It have been tested on Linux and will work on other unix-like OSs.
As for Windows, it needs a getopt(3) compatible function, which is found in 
many places as a public domain function.

Original comment by kubo.tak...@gmail.com on 31 Jul 2011 at 12:12

Attachments:

GoogleCodeExporter commented 9 years ago
Sorry, I failed to attach a correct file.
I attached a new one.

Original comment by kubo.tak...@gmail.com on 31 Jul 2011 at 12:16

Attachments:

GoogleCodeExporter commented 9 years ago
kubo your patch seems to work well; i did have to make one change for missing 
'PACKAGE_STRING' and it was not being compiled correctly by default when i do 
'make snzip', but the utility is exactly what i was looking for. I've also 
added a -v to print out the version 1.0.3

Original comment by jehiah on 12 Aug 2011 at 7:52

GoogleCodeExporter commented 9 years ago
I made a new patch to support mingw32 and cygwin.

> kubo your patch seems to work well; i did have to make one change for missing 
'PACKAGE_STRING' and it was not being compiled correctly by default when i do 
'make snzip', but the utility is exactly what i was looking for. I've also 
added a -v to print out the version 1.0.3

The missing macro 'PACKAGE_STRING' is defined in config.h by autoconf.
What version of autoconf do you use? I'm using autoconf 2.65.

I also prefer the '-v' option to print out the version. But gzip and bzip2
use it for verbose output option. So I didn't add it.

Original comment by kubo.tak...@gmail.com on 21 Aug 2011 at 9:37

Attachments:

GoogleCodeExporter commented 9 years ago
>> i did have to make one change for missing 'PACKAGE_STRING' and it was not 
being compiled correctly by default when i do 'make snzip'

Could you provide your changes?

Original comment by and...@inffinity.com on 22 Sep 2011 at 2:31

GoogleCodeExporter commented 9 years ago
guys im a litle confused here, shouldnt the download of snappy.h allow you to 
simply run this command:

snappy::Compress('/tmp/testfileinput', '/tmp/testfileoutput');

from within your c++ code? just two simple string inputs?

Original comment by mina.mou...@hotmail.com on 24 Sep 2011 at 5:00

GoogleCodeExporter commented 9 years ago
@mina.moussa : 

snappy::Compress reads the full input, performs compression, writes the full 
output.

imagine you have 5 TB of data to compress ... what do you do? well, you can buy 
lots of ram and harddisks to swap to while the compression happens. 

or better yet you can write a loop that reads in chunks of the file, runs them 
through snappy::Compress and writes each chunk to an output file with a 
container format that can later be decompressed by reading in discrete chunks 
and decompressing them. 

though i haven't played with these command line tools, if they behave properly 
they should allow you to avoid having to come up with a container file format 
and avoid writing loops for working on small chunks of the input at a time by 
enabling streaming of input to the tool which would stream 
compressed/decompressed output.

Original comment by dwil...@builderadius.com on 26 Sep 2011 at 2:49

GoogleCodeExporter commented 9 years ago
Yes, this is probably what you'd want for a command-line tool supporting pipes: 
A simple framing format. For each block, probably the compressed length (the 
uncompressed length is already in the format), perhaps some flags (EOF?), and 
the CRC32c of the uncompressed data in that block.

Original comment by se...@google.com on 26 Sep 2011 at 2:54

GoogleCodeExporter commented 9 years ago
We have a simple framing format for streaming in the Java port of Snappy:

https://github.com/dain/snappy/blob/master/src/main/java/org/iq80/snappy/SnappyO
utputStream.java

Each 32k block is preceded by a 3-byte header, which is a 1-byte flag 
indicating if the block is compressed or not, and a 2-byte length of the block.

Our main requirements were speed and the ability to concatenate compressed 
files.  The gzip format allows concatenation, but the common Java libraries 
don't support this.  We avoided writing a checksum for simplicity and speed.  
The format doesn't currently have a header (magic number), but using a whole 
byte for the compressed flag allows adding one later.

It would be nice to have a standard streaming format and tools.  We're going to 
try to get the Hadoop project to use this format too (which is our primary use 
case).

Original comment by electrum on 26 Sep 2011 at 5:24

GoogleCodeExporter commented 9 years ago
The ability to concatenate is an interesting feature. Something that would 
combine this with the ability to detect file format would be the best, though, 
so you won't need yet another container format for that.

Not doing checksumming sounds a bit suboptimal; you can do it really cheaply on 
modern CPUs (gigabytes per second per core), especially since the data is 
already going to be in the L1 cache. Especially with multiple implementations 
starting to float around (Java vs. C++ vs. Go), it's easy to get something 
subtle going wrong.

Original comment by se...@google.com on 28 Sep 2011 at 10:28

GoogleCodeExporter commented 9 years ago
Steinar, you have a good point about checksums.

We updated the stream format to contain the masked CRC32C of the input data, 
providing protection against corruption or a buggy implementation.  We also 
added a  file header "snappy\0", which happens to be the same size (7 bytes) as 
the block header.  The file header may procede any block header one or more 
times, thus supporting concatenation including "empty" files (that contain only 
the file header).

See the SnappyOutputStream link above for the formal description.  Does this 
format sound reasonable to standardize?

Original comment by electrum on 30 Sep 2011 at 8:32

GoogleCodeExporter commented 9 years ago
OK, this starts to sound pretty good to me -- I should probably get somebody 
else in here to look at it as well, but it starts to become reasonable.

Some questions (mostly nits):

 - What do you need the uncompressed/compressed flag for? In what situations would you want to store the data uncompressed?
 - Is the length 16-bit signed or unsigned? Why is it 32768 and not 32767 or 65535?
 - Should the lengths really be stored big-endian, when all other numbers in Snappy are stored little-endian?
 - Can you verify that the CRC32c polynomial you're using is compatible with what the SSE4 CRC32 instruction computes? It sounds reasonable that if we're defining a new format, an implementation in native code should be able to make use of that instruction.

Thanks!

Original comment by se...@google.com on 3 Oct 2011 at 9:50

GoogleCodeExporter commented 9 years ago
Some drive-by comments:

For the uncompressed/compressed flag, leveldb's tables uses snappy, but if the 
compression doesn't save more than 12.5% of the bytes, then the block is left 
uncompressed on disk:
http://code.google.com/p/leveldb/source/browse/table/table_builder.cc#147

For checksums, it looks like github.com/dain is using the same CRC32c-based 
checksum as leveldb:
https://github.com/dain/snappy/blob/master/src/main/java/org/iq80/snappy/Crc32C.
java
http://code.google.com/p/leveldb/source/browse/util/crc32c.h#28

Original comment by nigel.ta...@gmail.com on 3 Oct 2011 at 10:32

GoogleCodeExporter commented 9 years ago
Here is a concrete proposal. It is possibly too complicated, but it does let a 
.snappy file start with a 7-byte magic header, and also allows concatenating 
multiple .snappy files together.

The byte stream is a series of frames, and each frame has a header and a body. 
The header is always 7 bytes. The body has variable length, in the range [0, 
65535].

The first header byte is flags:
  - bit 0 is comment,
  - bit 1 is compressed,
  - bit 2 is meta,
  - bits 3-7 are unused.

The comment bit means that the rest of the header is ignored (including any 
other flag bits), and the body has zero length. Thus, "sNaPpY\x00" is a valid 
comment header, since 's' is 0x73.

For non-comment headers, the remaining 6 bytes form a uint16 followed by a 
uint32, both little-endian. The uint16 is the body length. The uint32 is a 
CRC32c checksum, the same as used by leveldb. This differs from the Java code 
linked to above in that it's little-endian (like the rest of Snappy), and the 
maximum body length is 65535, not 32768.

The compressed bit means that the body is Snappy-compressed, and that the body 
length and checksum refer to the compressed bytes. If the bit is off, the body 
is uncompressed, and the body length and checksum refer to the uncompressed 
bytes. Each frame's compression is independent of any other frame.

The meta bit means that the body is metadata, and not part of the data stream. 
This is a file format extension mechanism, but there are no recognized 
extensions at this time.

A conforming decoder can simply skip every frame with the comment or meta bits 
set.

Original comment by nigel.ta...@gmail.com on 4 Oct 2011 at 11:03

GoogleCodeExporter commented 9 years ago
I've written a Go implementation of that proposal at 
http://codereview.appspot.com/5167058. It could probably do with a few more 
comments, but as it is, it's about 250 lines of code.

I added an additional restriction that both the compressed and uncompressed 
lengths of a frame body have to be < 65536, not just the compressed length. 
This restriction means that I can allocate all my buffers up front. Thus, once 
I've started decoding, I don't need to do any extra mallocs regardless of how 
long the stream is, or whether the uncompressed stream data looks like 
"AAAAAAAA...".

Original comment by nigel.ta...@gmail.com on 4 Oct 2011 at 1:15

GoogleCodeExporter commented 9 years ago
Answers to Steinar's questions:

Why the uncompressed/compressed flag?  As mentioned above by Nigel, for the 
same reason that leveldb does it.  Because Snappy's goal is speed, and doesn't 
compress well compared to slower algorithms like zlib, it makes sense to 
sacrifice a little more space for speed.  (We chose the same cutoff as leveldb, 
12.5%, but the cutoff is independent of the format.)

The 16-bit length is unsigned.  Why 32768 and not 65535?  Two reasons.  First, 
it matches Snappy's internal block size.  Because Snappy will split larger 
blocks, the only potential gain is fewer chunk headers.  Second, it is a power 
of two.  If you use 65535 and compress 64k (65536) bytes of data, then you end 
up with two chunks, with the second chunk being only 1 byte.

Should the length be big endian or little endian?  We chose big endian because 
that's common for file formats and network protocols.  Given that Snappy uses 
little endian, I have no objections to changing it.

The CRC32C was chosen specifically to be compatible with the SSE4 instruction.  
It's a bug if it's not.  The Java implementation uses the CRC32C code from 
Hadoop, which we haven't verified extensively, but it matched in cursory checks 
against the Python leveldb reader.

Original comment by electrum on 5 Oct 2011 at 6:03