Closed sebastian-nagel closed 4 years ago
Anybody already working on this? Otherwise I would try to fix it.
It looks like I mistranslated the BNF. RFC7230 says:
last-chunk = 1*("0") [ chunk-ext ] CRLF
But ChunkedBody.rl has:
last_chunk = "0\r\n";
So that probably needs to be:
last_chunk = "0"+ chunk_ext* "\r\n";
in order to correctly match the zero-padded chunk length "00000000".
Checking it with a hex editor, it looks to me like the value of the WARC Content-Length header in the record is off by one again so the first CR from the CRLFCRLF trailer is being interpreted as part of the payload.
last_chunk = "0"+ chunk_ext* "\r\n";
Yes. I've also had to check RFC 5234 to get that 1*
means "at least one" and not exactly one. Otherwise 1*HEXDIG
would also make no sense.
I've created issue #29 for the content-length off by one issue seen in http_chunked_2.warc.gz.
Reading the payload with Transfer-Encoding chunks may result in an exception thrown after the entire chunked body has been consumed.
EOFException (http_chunked_1b.warc.gz):
ParseException (http_chunked_2.warc.gz):
WARC files have been recorded using Wget. See #23 for the logging of the current context (position in buffer/stream).