lz4 / lz4-java

LZ4 compression for Java
Apache License 2.0
1.11k stars 252 forks source link

LZ4BlockInputStream cannot read two consecutive write-close operations from two different LZ4BlockOutputStream #48

Closed schedin closed 6 years ago

schedin commented 10 years ago

How to reproduce: Run the test case testWriteCloseWriteCloseRead(). In pseudo code:

  1. Write some data to a file with a LZ4BlockOutputStream and close the stream
  2. Write some more data to the same file with a new LZ4BlockOutputStream and close the stream.
  3. Read the sum of the data with one single instance of LZ4BlockInputStream
  /**
   * Write and close two stream instances to the same file. Read the entire data with one
   * LZ4BlockInputStream.
   */
  @Test
  public void testWriteCloseWriteCloseRead() throws IOException {
    final byte[] testBytes = "Testing!".getBytes(Charset.forName("UTF-8"));

    //Write the first time
    ByteArrayOutputStream bytes = new ByteArrayOutputStream();
    LZ4BlockOutputStream out = new LZ4BlockOutputStream(bytes);
    out.write(testBytes);
    out.close();

    //Write the second time
    out = new LZ4BlockOutputStream(bytes);
    out.write(testBytes);
    out.close();

    ByteArrayInputStream in = new ByteArrayInputStream(bytes.toByteArray());
    LZ4BlockInputStream lz4In = new LZ4BlockInputStream(in);
    DataInputStream dataIn = new DataInputStream(lz4In);

    byte[] buffer = new byte[testBytes.length];
    dataIn.readFully(buffer);
    assertArrayEquals(testBytes, buffer);

//    in.skip(LZ4BlockOutputStream.HEADER_LENGTH); //This test case can only be passed if 21 bytes (the footer) is skipped

    buffer = new byte[testBytes.length];
    dataIn.readFully(buffer);
    assertArrayEquals(testBytes, buffer);
  }

Actual: An java.io.EOFException is thrown

Expected: The sum of the data should be read and returned.

Analysis: The LZ4BlockOutputStream will write a header, data and a footer. The footer is very similar to the header. Two LZ4BlockOutputStreams will create this: Header | Compressed Data | Footer | Header |Compressed Data | Footer One instance of LZ4BlockInputStream will read the header and the compressed data. If the user tries to read more data it will try to read a header again. But since it has not skipped the previous footer it will read the footer instead. The footer, although similar to the header contains a 0 length and will therefore return -1 from the read() method and the DataInputStream will thus throw a EOFException.

If the user manually skips 21 bytes (the length of the header/footer) the LZ4BlockInputStream will happily continue to read another “frame” (se the out-commeted row in the test case).

Workaround: The user can manually call in.skip(21).

Suggested fix: I think it would be appropriate if a LZ4BlockInputStream consumes all bytes related the one frame: that is the footer should be consumed when the end of the frame has been reached

I’m guessing the solution might be a bit trickier because the footer is related to the frame and the header to the block? (I’m probably using the term block and frame wrong)

Another approach would be to just say that this should not be possible. But this “feature” works with a normal GZIPOutputStream/GZIPInputStream so it would be good if it also works with LZ4.

odaira commented 6 years ago

Fixed by #105. Thanks much for your suggestion!