iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

WARC 1.1: Introduce record-id BNF grammar rule for consistency with examples #24

Closed ato closed 6 years ago

ato commented 9 years ago

In the examples and in all popular implementations, URIs in the WARC-Target-URL and WARC-Profile fields are not surrounded by "<" and ">" characters. This change makes the grammar consistent with practice by removing "<" and ">" from the basic uri rule and introducing a new record-id rule for the fields WARC-Record-ID, WARC-Concurrent-To, WARC-Refers-To, WARC-Warcinfo-ID and WARC-Segment-Origin-ID.

Fixes #23

kris-sigur commented 9 years ago

Makes sense to me.

I wonder if it is appropriate to include some kind of "errata" as well to address how this was mishandled in the previous standard?

anjackson commented 9 years ago

I added a Document History section with this kind of thing in mind, but maybe a dedicated Errata bit would be better?

https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc-format/warc-1.1/index.md#document-history

ato commented 9 years ago

My experience in this area is very limited, but in most of the standards I have read the errata is a separate document associated with the version containing the error. eg #25

Revisions I've seen note changes if there are compatibility concerns in a "Changes since 1.0" section or just inline where the relevant item is discussed. For example:

In version 1.0 of the WARC standard the uri grammar rule was defined incorrectly with respect to the examples in the specification and with common implementations. For compatiblity implementations may choose to accept but should never emit URIs surrounded by '<' and '>' in the WARC-Target-URL and WARC-Profile fields.

@anjackson, should I add a document history entry to this pull request? I'd be happy to do so. I wasn't sure if it would cause problems when merging and whether the date should refer to now or the date of merging.

saraaubry commented 9 years ago

The following changes have been integrated in the revised ISO draft during the ISO working group meeting on November 16-17, 2015:

in section 4 file and record model, change the definition of uri and add a note: uri = <'URI' per RFC3986>

NOTE: in WARC 1.0 standard (ISO 28500:2009), uri was defined as "<" <'URI' per RFC3986> ">". This rule has been changed to meet requests from implementers.

saraaubry commented 6 years ago

Included in WARC 1.1