lz4 / lz4-java

LZ4 compression for Java
Apache License 2.0
1.1k stars 253 forks source link

Add Content Size information exposition to LZ4FrameInputStream #144

Closed BastienF closed 4 years ago

BastienF commented 5 years ago

This PR to allow users to retrieve Content Size information before uncompressing file. This is especially usefully when working with AWS S3 filesystem. Because of the S3 streaming API limitation that requires the file size before starting to write it, access to Content Size a priori is mandatory to avoid loading the file in memory.

As workaround we are currently forced to directly read the Content Size header bytes from the lz4 file byte array.

BastienF commented 5 years ago

Travis build is unstable but the PR is OK (Cf. https://travis-ci.org/lz4/lz4-java/builds/550391117)

odaira commented 4 years ago

@BastienF Thanks for your contribution. I basically agree with your proposal, but I would like to apply some changes, so I made a PR to your branch. Could you review and merge my PR?

https://github.com/BastienF/lz4-java/pull/1

The important change is to introduce readSingleFrame to LZ4FrameInputStream's constructors so that the instance reads only the first frame from the stream. And getExpectedContentSize() is valid only when readSingleFrame is true. This is because when LZ4FrameInputStream reads multiple concatenated frames, a user of LZ4FrameInputStream does not know where a new frame begins, so a call to getExpectedContentSize() does not make sense.

I guess in your use case an lz4-compressed file consists of only one frame, so I expect this change makes sense.

odaira commented 4 years ago

@BastienF I am thinking of deferring nextFrameInfo() until an actual read() is called, as proposed in #146. This change requires adding throws IOException to getExpectedContentSize() and isExpectedContentSizeDefined(). I appreciate it if you could let me know, if you have any concern. I should have added throws IOException when we discussed the API, because these methods depend on the contents read by I/O, so in principle they can throw IOException at any time.

BastienF commented 3 years ago

@odaira Sorry for the late response, no problem for me. Thanks for the concern.