lz4 / lz4-java

LZ4 compression for Java
Apache License 2.0
1.1k stars 253 forks source link

When reading from LZ4BlockInputStream using read(buffer) that refill mechanism doesnt work properly #154

Closed tomerz90 closed 4 years ago

tomerz90 commented 4 years ago

When you read X bytes from the LZ4BlockInputStream (read(new byte[X])) but there is only X -Y bytes left in the block, it will return X - Y bytes and will refill, instead (what I believe is the correct behaviour) of filling the output buffer with that X -Y bytes, refilling and adding the missing Y bytes from the new block we just refilled. Current code:

public int read(byte[] b, int off, int len) throws IOException {
    SafeUtils.checkRange(b, off, len);
    if (finished) {
      return -1;
    }
    if (o == originalLen) {
      refill();
    }
    if (finished) {
      return -1;
    }
    len = Math.min(len, originalLen - o);
    System.arraycopy(buffer, o, b, off, len);
    o += len;
    return len;
  }

Possible fix (note this is just a "quick and dirty" to illustrate the fix):

public int read(byte[] b, int off, int len) throws IOException {
    SafeUtils.checkRange(b, off, len);
    if (finished) {
      return -1;
    }
    int remainingInBuff = originalLen - o;
    if (remainingInBuff >= len) {
      System.arraycopy(buffer, o, b, off, len);
      o += len;
    } else {
      System.arraycopy(buffer, o, b, off, remainingInBuff);
      off += remainingInBuff;
      refill();
      return remainingInBuff + read(b, off, len - remainingInBuff);
    }
    return len;
  }
odaira commented 4 years ago

Well, this is not a "bug" because it is not guaranteed that the read method fills the entire buffer, even when it is possible. Users of the read method must always check its return value to see how many bytes were actually read. The only thing guaranteed is that the read method will read at least one byte if any data is available.

And it is usually a good strategy to delay refill() because it is quite heavyweight. In what use case have you seen a problem with the current implementation?

tomerz90 commented 4 years ago

What I suggested still delays refill() until we must. The use case for me is when I read something of know size, for example having a files of integers written in binary format, so I want to read 4 bytes every time

odaira commented 4 years ago

Any library should be optimized for the correct ways of using its API and should not be optimized for incorrect use cases. Your way of using the read method is not correct. You must always check how many bytes were actually read. You may want to wrap LZ4BlockInputStream with DataInputStream and to use its readFully method.

odaira commented 4 years ago

Please reopen if you need further assistance.