kubo / snzip

Snzip, a compression/decompression tool based on snappy
Other
216 stars 30 forks source link

Got "Invalid data: snappy::Uncompress failed" when decompressing raw file #24

Closed zxybazh closed 2 months ago

zxybazh commented 6 years ago

I compressed a raw file with snzip -t raw file and when I run snzip -t raw -d file.raw I got the error message of uncompress failed.

kubo commented 6 years ago

Could post more information?

It works for me.

$ ./snzip -t raw INSTALL
$ ./snzip -t raw -d INSTALL.raw 

My environment is: OS: Linux (Ubuntu 16.04 x86_64) Test data: INSTALL

zxybazh commented 6 years ago

Hi, I did the test on Ubuntu 16.04, CPU Intel(R) Core(TM) i7-7700. Test data is right here, part of a TPCH dataset. Please check, thanks!

kubo commented 6 years ago

Thanks. The compressed file is incorrectly compressed because of too big data. The maximum size of raw uncompressed data is 4G according to this information.

There are two choices.

  1. Make snzip -t raw fail when the file size is over 4G.
  2. Split file data by 4G and create a compressed file containing concatenated compressed split data.
zxybazh commented 6 years ago

Got it, thanks.

kubo commented 6 years ago
  1. Make snzip -t raw fail when the file size is over 4G.
  2. Split file data by 4G and create a compressed file containing concatenated compressed split data.

The latter is impossible. I can create a file containing concatenated raw compressed data. However I cannot decompress it because snappy checks whether all input data are consumed or not by decompressor->eof(). When two raw compressed data are concatenated, there is no way to know the boundary.

zxybazh commented 6 years ago

I believe we have to make a new file format to store the file length information for splits of raw compressed data over 4G in case we can split them again when decompressing.

kubo commented 6 years ago

What merit does the new file format have? I won't reinvent the wheel unless it has explicit merit.

zxybazh commented 6 years ago

Well, you're right. Let's not reinvent the wheel. It's just that I want to make sure that we can get the boundary for every split when we want to decompress the file. If there is something already there, it would be even better. For now, you may just make it fail when file size is over 4G.