Closed brackxm closed 8 years ago
+1 - treating data as unframed/raw if the frame headers aren't found is preferable to failing out with "Unknown file header" (or forcing people to prepend something static)
Is this what hadoop-snappy uses? If so, +1.
@aeroevan
Is this what hadoop-snappy uses?
No. Hadoop-snappy uses its own framing format. I suppose its format by reading BlockCompressorStream.java and BlockDecompressorStream.java. But I'm not sure that it is correct because I don't have real data. A data file on disk may have additional leading and trailing parts out of the the two java files.
If you send me sample data compressed by hadoop-snappy, I may add hadoop-snappy uncompressor and compressor.
As for raw format, I did half about a half year ago but have not completed it...
See: iris.zip
The .snappy (generated on a HDP 2.3 cluster) and plain text .csv are both in the zip.
@aeroevan Thank you.
I suppose the format of hadoop-snappy as follows:
The contents of iris.zip
look like what I suppose.
iris.snappy
consists of one block.
The block consists of uncompressed length (4550 in decimal) and one subblock.
The subblock consists of compressed length (1476 in decimal) and raw compressed data.
4550 is the file size of iris.csv
.
4 + 4 + 1476 is the file size of iris.snappy
.
what about a raw format option? maybe with a maximum size limit