kubo / snzip

Snzip, a compression/decompression tool based on snappy
Other
216 stars 30 forks source link

Hadoop is unable to decompress #19

Closed naoko closed 7 years ago

naoko commented 7 years ago

Hello! So.. with snzip -t hadoop-snappy <file_to_compress> I can compress and decopress with snzip -d <snappy_file> just fine. I moved the file to hadoop cluster and ran: hadoop fs -text <snappy_file> and got the following error and not sure where to go from here and would like to have your advice please.

17/01/18 14:17:47 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:123)
    at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98)
    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
    at java.io.InputStream.read(InputStream.java:101)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
    at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:106)
    at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:101)
    at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
    at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
    at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
    at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
    at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
    at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
    at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)

I was able to run hadoop fs -text <much-bigger-snappy> for much bigger file and no problem. so Memory Error is misleading... please let me know if there is anything I can provide.

kubo commented 7 years ago

IMO, contents of the snappy file will not be expected format. OutOfMemoryError was raised at this line. Java tried to allocate memory whose size was len, which was compressed length of a chunk of raw compressed data. The length is usually 256k at most. However, the length was too large to fail to allocate memory.

I guess two probabilities. (1) The snappy format used in Hadoop is different with what I expected. (2) The file in the hadoop is not a snappy file but has .snappy suffix.

naoko commented 7 years ago

Thank you for your response, Kubo, I double-checked by downloading the .snappy file and was able to uncompress with snzip command so the remaining possibility is the length. So that means I should find the value of io.compression.codec.snappy.buffersize then use -b flag to run the command? Did I understand correctly?

kubo commented 7 years ago

-b won't fix this case. If -b is too large, hadoop prints Could not decompress data. Buffer length is too small. I had not used hadoop. So I set up hadoop hdfs today and checked whether a files compressed by snzip could be retrieved via hadoop fs -text. As far as I checked it worked.

  1. Could you run the following commands?

    $ echo Hello World > hello.txt
    $ snzip -t hadoop-snappy hello.txt 
    $ hadoop fs -put hello.txt.snappy
    $ hadoop fs -text hello.txt.snappy
    17/01/29 17:58:31 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
    Hello World
  2. What OS do you use?

  3. What version of hadoop do you use?

    $ hadoop version
    Hadoop 2.7.3
    Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
    Compiled by root on 2016-08-18T01:41Z
    Compiled with protoc 2.5.0
    From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
    This command was run using /home/kubo/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar
  4. What version of snappy does the hadoop use? If you use linux,

    $ strace -f -o strace.log -e trace=open hadoop fs -text <snappy_file>
    $ grep libsnappy strace.log | grep -v '= -1'

    If the output is 29484 open("/usr/lib/x86_64-linux-gnu/libsnappy.so.1", O_RDONLY|O_CLOEXEC) = 199, the snappy library used by the hadopp is /usr/lib/x86_64-linux-gnu/libsnappy.so.1.

    $ ls -l /usr/lib/x86_64-linux-gnu/libsnappy.so.1
    lrwxrwxrwx 1 root root 18 Oct  6  2015 /usr/lib/x86_64-linux-gnu/libsnappy.so.1 -> libsnappy.so.1.3.0

    The snappy version is 1.3.0 because the real file name is libsnappy.so.1.3.0.

  5. What version of snzip do you use?

    $ snzip -h
    snzip 1.0.4
    
     Usage: snzip [option ...] [file ...]
    
    ...
  6. What version of snappy does the snzip use? If you use linux,

    $ env LD_TRACE_LOADED_OBJECTS=1 snzip
        linux-vdso.so.1 =>  (0x00007fff1f95d000)
        libsnappy.so.1 => /usr/lib/x86_64-linux-gnu/libsnappy.so.1 (0x00007fdf4d170000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fdf4cdc9000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fdf4cbb3000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fdf4c7ea000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fdf4c4e1000)
        /lib64/ld-linux-x86-64.so.2 (0x000055fa78904000)

    The snzip uses /usr/lib/x86_64-linux-gnu/libsnappy.so.1.

    $ ls -l /usr/lib/x86_64-linux-gnu/libsnappy.so.1
    lrwxrwxrwx 1 root root 18 Oct  6  2015 /usr/lib/x86_64-linux-gnu/libsnappy.so.1 -> libsnappy.so.1.3.0

    The snappy version is 1.3.0 because the real file name is libsnappy.so.1.3.0.

naoko commented 7 years ago

@kubo , thank you very much for taking your time. I followed your instruction and was able to uncompress with no issue. So I scratched my head and went back to my original file and this time there is no error... it uncompressed just fine. I am utterly confused and feel so bad and ashamed :( I'm terribly sorry for this ticket and thank you again for taking your time. If I ever find the reason why this works now I will report back. I am closing this ticket now. Thank you very much for providing great library.

kubo commented 7 years ago

No problem. It was goo chance for me to try installing hadoop.

JasonWiki commented 7 years ago

snzip -t hadoop-snappy hello.txt

Nice !!!