iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

'conversion' records are underspecified #40

Open ato opened 6 years ago

ato commented 6 years ago

Problem 1: dates

What should WARC-Date on a 'conversion' record be? Section 5.4 says:

The timestamp shall represent the instant that data capture for record creation began.

Does 'data capture' in the context of a conversion refer to the capture of the original record? Or does it refer to the moment you started writing the transformed content? If the former how do you record the date of transformation? If the latter how do you know the date the resource was originally archived? Presumably by following WARC-Refers-To header right?

However section 6.8 'conversion' includes this statement:

Each transformation should result in a freestanding, complete record, with no dependency on survival of the original record.

Which implies you should not rely on the original record for anything... but how do you actually do that?

One solution to this problem would to be to allow and recommend WARC-Refers-To-Date on 'conversion' records. The case of a conversion of a conversion needs specifying too.

Problem 2: protocol headers

If you convert request or response record do you convert the HTTP headers too? If you don't we run into the 'freestanding, complete record' problem again. Some HTTP headers are necessary for replay.

The examples and this statement sort of imply you don't include protocol headers:

For ‘conversion’ records, the payload is defined as the record block.

Can you use a conversion record to transform from one protocol to another?

Problem 3: determining the type of the original record

Again we trip over 'freestanding, complete'. After the original record is lost how do you know if the conversion was made from a 'response' or 'request' record? Nothing seems to imply you couldn't make a 'conversion' of a 'request' or even a 'warcinfo' for that matter.

ato commented 6 years ago

The implementation guidelines have this to say on the WARC-Date matter:

Note that a different behavior should be adopted for payload migration: according to the standard, the WARC-date of a conversion record is the date of the creation of the new record, that is when the migration occurred. There is indeed a great difference between converting a file from a container format to another, and migrating the format of this file.