net.jpountz.lz4.LZ4Exception: Malformed input

pavanprao commented 4 years ago

Hello,

I'm trying to decompress using "LZ4DecompressorWithLength.getDecompressedLength(input)" which is providing a huge byte size even for 1000 bytes of data. For e.g. If the Input Size is 2008 bytes then the Decompressed Length is 1575813496 bytes.

This is leading to OutOfMemoryError - Java Heap Space problem and the server is crashing. Could you please suggest how to approach this problem...

FYI: The decompression is happening inside a multi-threaded code.

LZ4 Library Used: 1.7.0 Java Version: 1.7

Thanks.

odaira commented 4 years ago

Is your input data compressed by LZ4CompressorWithLength? LZ4DecompressorWithLength is supposed to be used for data compressed by LZ4CompressorWithLength, NOT by LZ4Compressor.

pavanprao commented 4 years ago

Let me just give you a history so that there will be more clarity. Initially we were using compression through Java ZIP libraries and most of the data is compressed using that. Now we are switching to LZ4 for a better performance. So the input for the first time for existing data will be a Java ZIP compressed data.

However I faced the same problem even when the data is compressed through LZ4.

And for your question, I'm using LZ4CompressorWithLength for the compression logic as I switched decompression code with LZ4DecompressorWithLength. But the issue still remains.

Here is a code snippet for both compression and decompression which I have implemented.

Compression Code: `byte[] input = message.getBytes(); if (input == null || input.length <= 1) { return null; } logger.info("Input Data Size: " + input.length);

        // -- Starting the LZ4 Compression.
        LZ4Factory factory = LZ4Factory.safeInstance();
        LZ4Compressor compressor = factory.fastCompressor();
        LZ4CompressorWithLength compressorWrapper = new LZ4CompressorWithLength(compressor);
        byte[] compressed = new byte[compressorWrapper.maxCompressedLength(input.length)];

        long startTime = System.currentTimeMillis();
        compressed = compressorWrapper.compress(input);
        long endTime = System.currentTimeMillis();
        logger.info("Time Taken for LZ4 Compression: " + (endTime - startTime));
        logger.info("Compressed Bytes: " + compressed.length);

        return new String(Base64.encode(compressed));`

Decompression Code: `byte[] input = Base64.decode(message.toCharArray()); if (input == null || input.length <= 1) { return null; } logger.info("Input Data Size: " + input.length);

        // -- Starting the LZ4 DeCompression.
        LZ4Factory factory = LZ4Factory.safeInstance();
        LZ4FastDecompressor decompressor = factory.fastDecompressor();
        LZ4DecompressorWithLength decompressorWrapper = new LZ4DecompressorWithLength(decompressor);
        byte[] decompressed = new byte[LZ4DecompressorWithLength.getDecompressedLength(input)];
        logger.info("Decompressed Length: " + decompressed.length);

        long startTime = System.currentTimeMillis();
        int decompressedBytes = decompressorWrapper.decompress(input, 0, decompressed, 0);
        long endTime = System.currentTimeMillis();
        logger.info("Time Taken for Decompression: " + (endTime - startTime));
        logger.info("Decompressed Bytes: " + decompressedBytes);

        return new String(decompressed);`

odaira commented 4 years ago

Hmm, I couldn't reproduce the problem with the following test, using lz4-java 1.7.0 on Java 8. I made these minor changes to your code:

I used java.util.Base64, but it looks like you are using your own implementation of Base64. Could you check the base64-decoded bytes byte[] input = Base64.decode(message.toCharArray()) are exactly the same as the bytes before base64-encoded Base64.encode(compressed)?
I don't allocate a byte array byte[] compressed = new byte[compressorWrapper.maxCompressedLength(input.length)]; because the compress(byte[]) method returns a compressed byte array.

import java.io.InputStream;
import java.io.BufferedInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import net.jpountz.lz4.*;
import java.util.Base64;

public class TestLZ4WithLength {

    private static String convertStreamToString(InputStream is) throws IOException {
    BufferedInputStream bis = new BufferedInputStream(is);
    ByteArrayOutputStream buf = new ByteArrayOutputStream();
    int result;
    while ((result = bis.read()) != -1) {
        buf.write((byte)result);
    }
    return buf.toString("UTF-8");
    }

    private static String encode(String message) {
    byte[] input = message.getBytes();
    if (input == null || input.length <= 1) {
        return null;
    }
    System.out.println("Input Data Size: " + input.length);

    // -- Starting the LZ4 Compression.
    LZ4Factory factory = LZ4Factory.safeInstance();
    LZ4Compressor compressor = factory.fastCompressor();
    LZ4CompressorWithLength compressorWrapper = new LZ4CompressorWithLength(compressor);
    byte[] compressed;

    long startTime = System.currentTimeMillis();
    compressed = compressorWrapper.compress(input);
    long endTime = System.currentTimeMillis();
    System.out.println("Time Taken for LZ4 Compression: " + (endTime - startTime));
    System.out.println("Compressed Bytes: " + compressed.length);

    return new String(Base64.getEncoder().encode(compressed));
    }

    private static String decode(String message) throws UnsupportedEncodingException {
    byte[] input = Base64.getDecoder().decode(message);
    if (input == null || input.length <= 1) {
        return null;
    }
    System.out.println("Input Data Size: " + input.length);

    // -- Starting the LZ4 DeCompression.
    LZ4Factory factory = LZ4Factory.safeInstance();
    LZ4FastDecompressor decompressor = factory.fastDecompressor();
    LZ4DecompressorWithLength decompressorWrapper = new LZ4DecompressorWithLength(decompressor);
    byte[] decompressed = new byte[LZ4DecompressorWithLength.getDecompressedLength(input)];
    System.out.println("Decompressed Length: " + decompressed.length);

    long startTime = System.currentTimeMillis();
    int decompressedBytes = decompressorWrapper.decompress(input, 0, decompressed, 0);
    long endTime = System.currentTimeMillis();
    System.out.println("Time Taken for Decompression: " + (endTime - startTime));
    System.out.println("Decompressed Bytes: " + decompressedBytes);

    return new String(decompressed, 0, decompressed.length, "UTF-8");
    }

    public static void main(String args[]) throws IOException {
    String input = convertStreamToString(System.in);
    String decodedStr = decode(encode(input));
    System.out.println("Same string? " + input.equals(decodedStr));
    }
}

Could you run this test on your environment to see if it causes the problem?

$ javac -cp lz4-java-1.7.0.jar:. TestLZ4WithLength.java
$ java -cp lz4-java-1.7.0.jar:. TestLZ4WithLength < /path/to/your/message/file

pavanprao commented 4 years ago

As we are using Java 1.7 we will not be able to use java.util.Base64 which is added in Java 1.8. But I tried the changes using org.apache.commons.codec.binary.Base64 which is also giving the same error.

Also the java.lang.OutOfMemoryError: Java heap space is still occurring which is causing the server to hang.

I have attached the logs for today's run with the modified code. Please do let me know if any further changes need to be done.

Thanks. lz4test_logs.txt

odaira commented 4 years ago

Let me run it on Java 7 sometime this week.

odaira commented 4 years ago

I ran the following code with lz4-java-1.7.0.jar and commons-codec-1.14.jar on Java 7, but couldn't reproduce the problem.

import java.io.InputStream;
import java.io.BufferedInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import org.apache.commons.codec.binary.Base64;
import net.jpountz.lz4.*;

public class TestLZ4WithLength {

    private static String convertStreamToString(InputStream is) throws IOException {
    BufferedInputStream bis = new BufferedInputStream(is);
    ByteArrayOutputStream buf = new ByteArrayOutputStream();
    int result;
    while ((result = bis.read()) != -1) {
        buf.write((byte)result);
    }
    return buf.toString("UTF-8");
    }

    private static String encode(String message) {
    byte[] input = message.getBytes();
    if (input == null || input.length <= 1) {
        return null;
    }
    System.out.println("Input Data Size: " + input.length);

    // -- Starting the LZ4 Compression.
    LZ4Factory factory = LZ4Factory.safeInstance();
    LZ4Compressor compressor = factory.fastCompressor();
    LZ4CompressorWithLength compressorWrapper = new LZ4CompressorWithLength(compressor);
    byte[] compressed;

    long startTime = System.currentTimeMillis();
    compressed = compressorWrapper.compress(input);
    long endTime = System.currentTimeMillis();
    System.out.println("Time Taken for LZ4 Compression: " + (endTime - startTime));
    System.out.println("Compressed Bytes: " + compressed.length);

    return new String(Base64.encodeBase64(compressed));
    }

    private static String decode(String message) throws UnsupportedEncodingException {
    byte[] input = Base64.decodeBase64(message);
    if (input == null || input.length <= 1) {
        return null;
    }
    System.out.println("Input Data Size: " + input.length);

    // -- Starting the LZ4 DeCompression.
    LZ4Factory factory = LZ4Factory.safeInstance();
    LZ4FastDecompressor decompressor = factory.fastDecompressor();
    LZ4DecompressorWithLength decompressorWrapper = new LZ4DecompressorWithLength(decompressor);
    byte[] decompressed = new byte[LZ4DecompressorWithLength.getDecompressedLength(input)];
    System.out.println("Decompressed Length: " + decompressed.length);

    long startTime = System.currentTimeMillis();
    int decompressedBytes = decompressorWrapper.decompress(input, 0, decompressed, 0);
    long endTime = System.currentTimeMillis();
    System.out.println("Time Taken for Decompression: " + (endTime - startTime));
    System.out.println("Decompressed Bytes: " + decompressedBytes);

    return new String(decompressed, 0, decompressed.length, "UTF-8");
    }

    public static void main(String args[]) throws IOException {
    String input = convertStreamToString(System.in);
    String decodedStr = decode(encode(input));
    System.out.println("Same string? " + input.equals(decodedStr));
    }
}

$ javac -cp lz4-java-1.7.0.jar:commons-codec-1.14.jar:. TestLZ4WithLength.java
$ java -cp lz4-java-1.7.0.jar:commons-codec-1.14.jar:. TestLZ4WithLength < tmpdata.txt
Input Data Size: 111261
Time Taken for LZ4 Compression: 14
Compressed Bytes: 58065
Input Data Size: 58065
Decompressed Length: 111261
Time Taken for Decompression: 3
Decompressed Bytes: 58065
Same string? true

Please run this code in your environment to see if the problem occurs, or please create a stand-alone program that reproduces the problem. Your logs do not help much because I have no idea about the internals of your program.

pavanprao commented 4 years ago

Sure Rei.

As suggested I tried running the above program with normal input data which works as expected. When I tried to run using the data which is encoded, we are facing the problem.

The LZ4DecompressorWithLength is providing a bigger value which is causing Java Heap Space error. Also I have upgraded to use Java 8 so currently using java.util.Base64 for encoding and decoding purposes. So no need to run in Java 7.

The encoded data which is being used is attached below (Sample_4.txt and Sample_5.txt). Also attached the logs generated from the standalone program.

Hope this is useful. Please do let me know if anything else is required.

Thanks. Happy New Year.

LOGS:

`File 5 Data Size Before: 2280 Data Size After Decode: 1709

LZ4 Decompression >>> Decompressed Length: 1468914296 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at net.jpountz.lz4.LZ4FastDecompressor.decompress(LZ4FastDecompressor.java:106) at net.jpountz.lz4.LZ4DecompressorWithLength.decompress(LZ4DecompressorWithLength.java:134) at net.jpountz.lz4.LZ4DecompressorWithLength.decompress(LZ4DecompressorWithLength.java:118) at com.wibmo.trident.compress.util.App.compressByLZ4(App.java:83) at com.wibmo.trident.compress.util.App.compress(App.java:49) at com.wibmo.trident.compress.util.App.main(App.java:28)`

Sample_4.txt Sample_5.txt

pavanprao commented 4 years ago

Attaching the standalone program also for reference. Thanks.

private void compressByLZ4(byte[] fileBytes) throws IOException {
        LZ4Factory factory = LZ4Factory.fastestInstance();

        byte[] input = Base64.getDecoder().decode(fileBytes);

        System.out.println("Data Size After Decode: " + input.length);
        System.out.println("\n>>> LZ4 Decompression >>>");
        LZ4FastDecompressor decompressor = factory.fastDecompressor();
        LZ4DecompressorWithLength decompressorWrapper = new LZ4DecompressorWithLength(decompressor);
        int maxDecompressedLength = LZ4DecompressorWithLength.getDecompressedLength(input);
        byte[] decompressed = new byte[maxDecompressedLength];
        System.out.println("Decompressed Length: " + decompressed.length);

        long startTime2 = System.currentTimeMillis();
        decompressed = decompressorWrapper.decompress(input);
        long endTime2 = System.currentTimeMillis();
        System.out.println("Time Taken for Decompression: " + (endTime2 - startTime2));

        System.out.println("Data Bytes read for Decompression: " + maxDecompressedLength);
        System.out.println("Data Size After Decompression: " + decompressed.length);
    }

odaira commented 4 years ago

Thanks, @pavanprao

I reproduced the same OutOfMemoryError in my environment, with Sample_4.txt and Sample_5.txt. Most likely, Sample_4.txt and Sample_5.txt are corrupt and are not in a valid LZ4 block format with length.

How did you generate Sample_4.txt and Sample_5.txt? Can I have a stand-alone program to generate these samples from the original uncompressed data?
The first four bytes of Sample_4.txt and Sample_5.txt after Base64-decoded are the same 0x578dda78. Are the original uncompressed data of Sample_4.txt and Sample_5.txt the same length?
Are the original uncompressed data of Sample_4.txt and Sample_5.txt ASCII texts? If LZ4 compresses an ASCII text, the compressed output bytes still tend to contain many ASCII characters as literals. However, when I look at Sample_4.txt and Sample_5.txt after Base64-decoded, they do not contain many ASCII characters.

odaira commented 4 years ago

Would you need any help on this?

pavanprao commented 4 years ago

Yes, we are still facing the same error. I will be able to provide the encoding logic used for the sample data which was provided last time.

pavanprao commented 4 years ago

The Sample_4.txt and Sample_5.txt are compressed using JAVA ZIP Deflater package. When I am using JAVA ZIP Inflater package on the same files, it is working and able to decompress.

But when the same input is provided to the LZ4DecompressorWithLength.getDecompressedLength(input) it will provide a unrealistic value such as 1451088504 bytes.

This is causing JVM to allocate these many bytes for memory and subsequently leading to OOM error.

odaira commented 4 years ago

LZ4 and ZIP are not compatible with each other. They are completely different compression formats.

pavanprao commented 4 years ago

Agreed. It is acceptable that an exception is thrown. But it should not give OOM error while finding the decompressed length. This is what is causing the issue.

odaira commented 4 years ago

It is not lz4-java but the application that allocates the decompression buffer, correct? All lz4-java can do is just to return the recorded decompressed length, and it is up to the application whether to trust the decompressed length or not. If the input data cannot be trusted, it would always be a good practice to check every input (in this case, check the decompressed length to make sure it is not too big compared with your heap size). This is not specific to lz4-java but is a general approach to handle a potentially corrupted input.

pavanprao commented 4 years ago

Understood. But I feel it is not a good practice to keep a static number and verify the decompressed length which may vary based on the input. If at all we keep such a value, what would be the value we need to specify. Because I notice the length varies based on the input.

Lets say if the input (valid input - LZ4 format) is quite large and then the decompressed length would be something similar to 1451088504 bytes, which by the way would be a proper decompressed length. If we kept a static numeric value to check the length, the check will discard the above value as incorrect even though it is a proper value.

Moreover, based on the below code the "decompress" method is allocating the memory isn't it?


LZ4Factory factory = LZ4Factory.safeInstance();
            LZ4FastDecompressor decompressor = factory.fastDecompressor();
            LZ4DecompressorWithLength decompressorWrapper = new LZ4DecompressorWithLength(decompressor);
            int maxDecompressedLength = LZ4DecompressorWithLength.getDecompressedLength(input);
            logger.info("Max Decompressed Length: " + maxDecompressedLength);

            long startTime = System.currentTimeMillis();
            byte[] decompressed = decompressorWrapper.decompress(input);
            long endTime = System.currentTimeMillis();
            logger.info("Time Taken for Decompression: " + (endTime - startTime));
            logger.info("Decompressed Bytes: " + decompressed.length);

odaira commented 4 years ago

based on the below code the "decompress" method is allocating the memory isn't it?

Oh yes, if you use decompress(byte[]), then lz4-java internally allocates the buffer. But as noted in the API document, it is a convenience method with a warning that it allocates a buffer. I think what we can do would be to add a stronger warning that do not use this method for an untrusted input and use decompress(byte[], byte[]) instead.

By the way, if the application could not determine the right value of the threshold (whether it is static or dynamic), how could lz4-java do better? Do you think the convenience method should not throw an OOM error? You cannot figure out whether allocating a new buffer will cause OOM or not until you actually allocate it. The convenience method could catch the OOM error and could throw another error, something like LZ4Exception. But is it different from just throwing the OOM error?

pavanprao commented 4 years ago

As we are not aware of the decompressed length we are relying on the LZ4 to calculate the length and then perform decompression.

It would be better if LZ4 can determine if the input is not in LZ4 format and not allocate the buffer but throw an LZ4Exception or InvalidFormatException which would prevent from causing a OOM error which is harmful to the performance of a system when subjected to a production environment.

Also do let me know when you say that the convenience method allocates the buffer with a warning in this case an OOM error would it hamper or impact the system performance? Meaning whether it will allocate that much memory and cause the system to crash when there are multiple such requests happening in parallel.

pavanprao commented 4 years ago

Any updates on the queries?

Also we found that we are not facing this error when we are using Snappy compression. But we wanted to use LZ4 as it is having better performance and higher compression ratio when compared to Snappy.

Thanks.

odaira commented 4 years ago

I would program like this if I have to usedecompress(byte[]) for an untrusted input:

LZ4Factory factory = LZ4Factory.safeInstance();
LZ4FastDecompressor decompressor = factory.fastDecompressor();
LZ4DecompressorWithLength decompressorWrapper = new LZ4DecompressorWithLength(decompressor);
byte[] decompressed = null;
try {
  decompressed = decompressorWrapper.decompress(input);
} catch (LZ4Exception ex) {
  System.err.println("Malformed input");
  return;
} catch (OutOfMemoryError err) {
  System.err.println("Decompressed length too large");
  return;
}

Is there any problem with this approach?

Also do let me know when you say that the convenience method allocates the buffer with a warning in this case an OOM error would it hamper or impact the system performance? Meaning whether it will allocate that much memory and cause the system to crash when there are multiple such requests happening in parallel.

I didn't get what you meant here, but an OOM error itself won't crash the system, right? When the JVM finds it is not possible to allocate an object with the specified size, it won't allocate anything and will throw an OOM error. You can try-catch the OOM error and can continue the execution without the object if you want to, as shown in the example code above.

odaira commented 4 years ago

Also we found that we are not facing this error when we are using Snappy compression.

Do you mean you can successfully uncompress a ZIP input with Snappy?

lz4 / lz4-java

net.jpountz.lz4.LZ4Exception: Malformed input #155