raw format - Githubissues

kubo / snzip

Snzip, a compression/decompression tool based on snappy

Other

216 stars 30 forks source link

raw format #11

Closed brackxm closed 8 years ago

brackxm commented 9 years ago

what about a raw format option? maybe with a maximum size limit

stevevaughan commented 9 years ago

+1 - treating data as unframed/raw if the frame headers aren't found is preferable to failing out with "Unknown file header" (or forcing people to prepend something static)

aeroevan commented 8 years ago

Is this what hadoop-snappy uses? If so, +1.

kubo commented 8 years ago

@aeroevan

Is this what hadoop-snappy uses?

No. Hadoop-snappy uses its own framing format. I suppose its format by reading BlockCompressorStream.java and BlockDecompressorStream.java. But I'm not sure that it is correct because I don't have real data. A data file on disk may have additional leading and trailing parts out of the the two java files.

If you send me sample data compressed by hadoop-snappy, I may add hadoop-snappy uncompressor and compressor.

As for raw format, I did half about a half year ago but have not completed it...

aeroevan commented 8 years ago

See: iris.zip

The .snappy (generated on a HDP 2.3 cluster) and plain text .csv are both in the zip.

kubo commented 8 years ago

@aeroevan Thank you.

I suppose the format of hadoop-snappy as follows:

A compressed file consists of one or more blocks.
A block consists of uncompressed length (big endian 4 byte integer) and one or more subblocks.
A subblock consists of compressed length (big endian 4 byte integer) and raw compressed data.

The contents of iris.zip look like what I suppose. iris.snappy consists of one block. The block consists of uncompressed length (4550 in decimal) and one subblock. The subblock consists of compressed length (1476 in decimal) and raw compressed data. 4550 is the file size of iris.csv. 4 + 4 + 1476 is the file size of iris.snappy.