iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
48 stars 8 forks source link

Avoid unchecked exceptions caused by malformed HTTP captures #39

Closed sebastian-nagel closed 4 years ago

sebastian-nagel commented 4 years ago

The WARC parser often throws unchecked exceptions (IllegalArgumentException) when the input cannot be parsed or if it violates certain constraints (examples below). These exceptions make it nearly impossible to use jwarc to parse real-world HTTP captures because unchecked exceptions are not declared and in general considered to be unrecoverable. At least, the lenient parser (#25) should ignore malformed input and try to continue. Alternatively, checked exceptions could be used to force the user to handle the errors.

So far, I've run into these two issues:

  1. duplicate parameters in the "Content-Type" HTTP header: text/html;Charset=utf-8;charset=UTF-8. This is a frequent error, see examples in content_type_dupl_param-CC-MAIN-20200525032636-20200525062636-00118.warc.gz. In CdxTool the IllegalArgumentException is caught, but if this is the intended usage, it'd be better to throw a checked exception.
  2. duplicated HTTP header field Transfer-Encoding. So far, I've only seen a duplicated Transfer-Encoding: chunked which could be safely read as one single header, see examples in transfer_encoding_duplicated.warc.gz. In theory, the transfer encoding can be multi-valued (Transfer-Encoding: chunked, gzip) and RFC 7230, 3.2.2 states that two single-value header fields (chunked and gzip) are equivalent. But I have not yet seen an example for this.
ato commented 4 years ago

Ah, yes those definitely weren't thought through very well. I've made the following changes:

  1. Subsequent media type parameters with a duplicate key are now ignored rather than throwing. (Browsers are inconsistent on whether to take the first or last value but first is the direction WHATWG seem to be moving in.)
  2. Added MessageHeaders.contains(name, value) for the common case of looking for case-insensitive tokens in comma-separated lists. The Transfer-Encoding chunked check now uses this.
  3. Replaced the headers.sole() calls on HTTP headers with headers.first() as this is generally the recovery strategy browsers use.

I've left sole() in place for the WARC header accessors for now. Open to revisiting that with an opt-in lenient WARC parsing mode or if there's an effort to standardise how invalid WARC records should be interpreted.