iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
48 stars 9 forks source link

Optional space after chunk-size in chunked transfer-encoding #33

Closed sebastian-nagel closed 4 years ago

sebastian-nagel commented 4 years ago

Some servers put optional space after the chunk-size which causes the following exception:

org.netpreserve.jwarc.ParsingException: chunked encoding at position 6944: ..."></span></a><ul class=dropdown-men\r\nD61<-- HERE --> \r\nu><li><a href="/mena/en/marketing/cor...
        at org.netpreserve.jwarc.ChunkedBody.parse(ChunkedBody.java:203)
        at org.netpreserve.jwarc.ChunkedBody.read(ChunkedBody.java:70)

Captured using wget: http_chunked_3c.warc.gz

Looks like the chunk-size is padded using blanks when it's shorter than 4 hex digits. Optional white space is not allowed by RFC 7230, however, assuming that the server header correctly indicates "Apache-Coyote/1.1", I tried to figure out whether this is a systematic problem: the issue is discussed in https://bz.apache.org/bugzilla/show_bug.cgi?id=41364 and it turns out that RFC 2616 allows optional "linear white space" after the chunk-size, maybe also in other positions where it is not yet considered:

implied *LWS The grammar described by this specification is word-based. Except where noted otherwise, linear white space (LWS) can be included between any two adjacent words (token or quoted-string), and between adjacent words and separators, without changing the interpretation of a field.