iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
48 stars 8 forks source link

WARC 1.0 quirk: angle brackets around WARC-Target-URI #30

Closed ato closed 4 years ago

ato commented 4 years ago

In WARC 1.0 the grammar specified the value of the WARC-Target-URI field as being wrapped in < and >. This was likely an editing mistake as it was not present in earlier drafts of the standard and is inconsistent with the examples in the standard itself and most implementations. It was corrected in WARC 1.1.

There is some software like wget 1.20.3 that generates WARCs with angle brackets in this field though and it really is what the standard said so we should strip them.