iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

Revise WARC-Date to allow varying levels of precision #21

Closed nlevitt closed 6 years ago

nlevitt commented 9 years ago

Revise WARC-Date specification to permit values with varying levels of precision. It is the same as the "Alternative Proposed Revised Spec" from http://nlevitt.github.io/warc-specifications/specifications/warc-date/allow-more-precise.html but with the addition of the sentence "This document recommends no particular algorithm for choosing a record by date when an exact match is not available." I also added an entry to Document History. See also https://github.com/iipc/warc-specifications/pull/6

saraaubry commented 9 years ago

The following changes have been integrated in the revised ISO draft during the ISO working group meeting on November 16-17, 2015 :

*section 5.4 WARC-Date

The WARC-Date is a UTC timestamp as described in the W3C profile of ISO8601 [W3CDTF], for example YYYY-MM-DDThh:mm:ssZ. The timestamp shall represent the instant that data capture for record creation began. Multiple records written as part of a single capture event (see section 5.7) shall use the same WARC-Date, even though the times of their writing will not be exactly synchronized.

WARC-Date may be specified at any of the levels of granularity described in [W3CDTF]. If WARC-Date includes a decimal fraction of a second, the decimal fraction of a second shall have a minimum of 1 digit and a maximum of 9 digits. WARC-Date should be specified with as much precision as is accurately known. This document recommends no particular algorithm for access software to choose a record by date when an exact match is not available.

WARC-Date   = "WARC-Date" ":" w3c-iso8601
w3c-iso8601 = <a UTC timestamp formatted according to [W3CDTF]>

*new section on WARC-Refers-To-Date:

WARC-Refers-To-Date   = "WARC-Refers-To-Date" ":" w3c-iso8601
w3c-iso8601 = <a UTC timestamp formatted according to [W3CDTF]>
saraaubry commented 9 years ago

To do: write use cases with a less and a more precise date.

anjackson commented 8 years ago

Note this interesting proposed standard for being more precise about approximate dates: http://www.loc.gov/standards/datetime/pre-submission.html#uncertain

saraaubry commented 8 years ago

hi @nlevitt, We need to write a use case with different WARC-Date examples to illustrate the changes. What do you thik about these? Use case A.X: the timestamp of a record created in 2015 may have different level of precisions. A web page has been captured b a crawler - WARC-Date: 2015-12-11T23:24:25Z A video has been donated by a website owner and converted to the WARC format - WARC-Date: 2015-12 An image has been captured through a warcproxy - WARC-Date: 2015-12-11T23:24:25,4+01:00Z

nlevitt commented 8 years ago

Are you looking for edits on the text or just the timestamps? If just the timestamps, these are fine

High precision dates would look like these:

saraaubry commented 8 years ago

If you have a more detailled and real life use case for a high precision date (rather than the one I made up: "An image has been captured through a warcproxy"), it would probably be clearer.

nlevitt commented 8 years ago

High precision dates should be the norm, with the adoption of 1.1. "WARC-Date should be specified with as much precision as is accurately known." We should avoid giving the impression that, for example, warcprox would write warcs with more precise dates than heritrix. So maybe:

A web page has been captured by a WARC 1.1 compliant crawler: WARC-Date: 2015-12-11T23:24:25.412030Z A web page has been captured by a legacy crawler: WARC-Date: 2015-12-11T23:24:25Z A video has been donated by a website owner and converted to the WARC format: WARC-Date: 2015-12