Closed sebastian-nagel closed 4 years ago
Ah, yes those definitely weren't thought through very well. I've made the following changes:
I've left sole() in place for the WARC header accessors for now. Open to revisiting that with an opt-in lenient WARC parsing mode or if there's an effort to standardise how invalid WARC records should be interpreted.
The WARC parser often throws unchecked exceptions (IllegalArgumentException) when the input cannot be parsed or if it violates certain constraints (examples below). These exceptions make it nearly impossible to use jwarc to parse real-world HTTP captures because unchecked exceptions are not declared and in general considered to be unrecoverable. At least, the lenient parser (#25) should ignore malformed input and try to continue. Alternatively, checked exceptions could be used to force the user to handle the errors.
So far, I've run into these two issues:
text/html;Charset=utf-8;charset=UTF-8
. This is a frequent error, see examples in content_type_dupl_param-CC-MAIN-20200525032636-20200525062636-00118.warc.gz. In CdxTool the IllegalArgumentException is caught, but if this is the intended usage, it'd be better to throw a checked exception.Transfer-Encoding
. So far, I've only seen a duplicatedTransfer-Encoding: chunked
which could be safely read as one single header, see examples in transfer_encoding_duplicated.warc.gz. In theory, the transfer encoding can be multi-valued (Transfer-Encoding: chunked, gzip
) and RFC 7230, 3.2.2 states that two single-value header fields (chunked
andgzip
) are equivalent. But I have not yet seen an example for this.