iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
48 stars 8 forks source link

Chunked transfer-encoding causes exceptions at end of WARC record #24

Closed sebastian-nagel closed 4 years ago

sebastian-nagel commented 4 years ago

Reading the payload with Transfer-Encoding chunks may result in an exception thrown after the entire chunked body has been consumed.

WARC files have been recorded using Wget. See #23 for the logging of the current context (position in buffer/stream).

sebastian-nagel commented 4 years ago

Anybody already working on this? Otherwise I would try to fix it.

ato commented 4 years ago

http_chunked_1b.warc.gz

It looks like I mistranslated the BNF. RFC7230 says:

last-chunk     = 1*("0") [ chunk-ext ] CRLF

But ChunkedBody.rl has:

last_chunk = "0\r\n";

So that probably needs to be:

last_chunk = "0"+ chunk_ext* "\r\n";

in order to correctly match the zero-padded chunk length "00000000".

http_chunked_2.warc.gz

Checking it with a hex editor, it looks to me like the value of the WARC Content-Length header in the record is off by one again so the first CR from the CRLFCRLF trailer is being interpreted as part of the payload.

sebastian-nagel commented 4 years ago

last_chunk = "0"+ chunk_ext* "\r\n";

Yes. I've also had to check RFC 5234 to get that 1* means "at least one" and not exactly one. Otherwise 1*HEXDIG would also make no sense.

ato commented 4 years ago

I've created issue #29 for the content-length off by one issue seen in http_chunked_2.warc.gz.