hadoop snappy format - Githubissues

kubo commented 8 years ago

Requested by @aeroevan in https://github.com/kubo/snzip/issues/11#issuecomment-169473389 and https://github.com/kubo/snzip/issues/11#issuecomment-170593741.

I suppose the format of hadoop-snappy as follows by reading BlockCompressorStream.java and BlockDecompressorStream.java.

A compressed file consists of one or more blocks.
A block consists of uncompressed length (big endian 4 byte integer) and one or more subblocks.
A subblock consists of compressed length (big endian 4 byte integer) and raw compressed data.

kubo commented 8 years ago

@aeroevan I committed 04a132d1aed63b061d7b9a0d1e312f20a91c5bd8 and c3e5926acf452c55d759781da1d24681e5faab35 to support hadoop-snappy format. Could you try it?

To compile from source at github.

# install autoconf and automake in advance
git clone --depth=1 git://github.com/kubo/snzip.git
cd snzip
./autogen.sh
./configure

To compress a file:

snzip -t hadoop-snappy file_name_to_be_compressed

Note: The default block size used by snzip for hadoop-snappy format is 256k. It is same with the default value of the io.compression.codec.snappy.buffersize paramter. If the block size is larger than the parameter, you would get an InternalError Could not decompress data. Buffer length is too small while hadoop is reading a file compressed by snzip. You need to change the block size by the -b option as follows if you get the error.

snzip -t hadoop-snappy -b 32768 file_name_to_be_compressed  # 32768 = 32 * 1024

To uncompress a file:

snzip compressed_file.snappy

Note: snzip may fail to detect the file format if the block size (io.compression.codec.snappy.buffersize) is larger than 256k. You may need to use -t hadoop-snappy to specify the file format.

aeroevan commented 8 years ago

I've tested decompressing a few files and it looks good.

I'll try to compress a few files and make sure the usual hadoop tools can read the data some time next week.

Thanks!

kubo commented 8 years ago

Thanks a lot for your testing and thanks in advance!

zavyrylin commented 8 years ago

@kubo I've done some "smoke" testing activity today. My hadoop installation could read and process several files compressed by snzip without any errors. Although sizes of files compressed by hadoop and by snzip differ, but uncompressing files gives the data with the same checksums.

kubo commented 8 years ago

@zavyrylin Thanks a lot! I'll release snzip 1.0.3 next weekend.

zavyrylin commented 8 years ago

Great!

kubo / snzip

hadoop snappy format #12