iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
48 stars 9 forks source link

Chunked body parser may read over end of chunk if destination buffer has higher capacity #34

Closed sebastian-nagel closed 4 years ago

sebastian-nagel commented 4 years ago

The optimization to bypass the internal buffer reads) if the destination buffer has a higher capacity than the internal buffer may cause a read over the end of the current chunk.

Reproducible with http_chunked_4.warc.gz and a buffer of 16 kB, e.g.,

ByteBuffer buffer = ByteBuffer.allocate(16384);
while (payload.get().body().read(buffer) > -1);

The chunk has size 16122 - the first read will the second bypassed read will consume all input until EOF (end of WARC record). It must be ensured that nothing more than the content of a single chunk is forwarded to the destination buffer.

Note: if the internal buffer has been bypassed, the error message in line 52 while refilling the internal buffer is wrong/misleading because it uses the outdated internal buffer to show the context. Should be also fixed.