Open justinlittman opened 9 years ago
The following changes have been integrated in the revised ISO draft during the ISO working group meeting on November 16-17, 2015 :
in section 5.9, change the sentence "The WARC-Payload-Digest recorded in the first segment of a segmented record is the digest of the payload of the logical record." to "The WARC-Payload-Digest recorded in the segments of a segmented record is the digest of the payload of the logical record."
and add a WARC-Payload-Digest in the B.8 continuation record example.
If you don't have the payload on the first segment, you shall not use this field because it is note mandatory.
What does this mean (and can it be stated more clearly)? "The WARC-Payload-Digest recorded in the segments of a segmented record is the digest of the payload of the logical record."
@justinlittman It means that the WARC-Payload-Digest
refers to the entire (reassembled) record/file (that may be spread over any number of WARC segment records).
That for the clarification. I guess I'm confused by "recorded in the segments of a segmented record" (but not "The WARC-Payload-Digest is the digest of the payload of the logical record.")
Which segment should it appear in? First? Last? Any? Every? Some? Multiple? What if it conflicts?
@justinlittman The first one.
The WARC-Payload-Digest recorded in the first segment of a segmented record is the digest of the payload of the logical record.
This is all cover in Chapter 7: Record Segmentation
Note that that the motivation for this proposed change is because recording it in the first segment is problematic. See the first comment in this ticket.
My apologies, I'd incorrectly remembered the exact history of this issue.
To be honest, I'm not sure a change is necessary here.
The WARC-Payload-Digest
is an optional field. You could simple omit it in the scenario you describe in the initial post.
To be able to detect bitrot you can then use the WARC-Block-Digest.
You can then either add a metadata record with the final digest, add it to the final segment using a non-standard header or let downstream tools calculate it as needed.
The change proposal @saraaubry entered seems fraught with danger, notably of collisions as @justinlittman has pointed out. It also risks confusing it and WARC-Block-Digest. As it stands it should not be adopted.
At minimum it should be stated that if there are more than one segment with a WARC-Payload-Digest
field they must all be equal. If they are not equal the all the relevant records should be regarded as invalid. Alternatively that only one of the segments may have this field.
It might also make sense to only allow this on the first and last segment.
However it is very hard to validate any of these rules as doing so requires scanning multiple WARCs.
The more I read the spec in detail, the more I think that it's actually allowed to have WARC-Payload-Digest
fields on continuation records (and also only there). Here's how I arrive at this conclusion:
The first point is obvious and has been mentioned before, but I'd like to reiterate it anyway since it's important for my interpretation below: the WARC-Payload-Digest
field is optional, meaning it may be omitted at any time.
The WARC-Payload-Digest
definition mentions this about segmentation:
the WARC-Payload-Digest recorded in the segments of a segmented record shall be the digest of the payload of the logical record
And the relevant part in the segmentation section is this:
The WARC-Payload-Digest recorded in the first segment of a segmented record is the digest of the payload of the logical record.
These sentences use "the WARC-Payload-Digest recorded in the (segments|first segment) of a segmented record" as the subject, i.e. they only specify what the value under those circumstances must be but not where the field is allowed. Notably, they don't say that the payload digest, if specified, must be on all or the first segment, respectively.
The segmentation section further says:
Segments other than the first should not contain other optional fields, as segments merely serve to continue the record data block of the first record.
Here, I understand "other optional fields" as "any optional field other than WARC-Block-Digest or WARC-Payload-Digest", since those are the two optional fields that are mentioned in the section. Furthermore, the above only states that the later segments segments "should not contain other optional fields", meaning it's not recommended but allowed.
These three quotes can be read as follows (in the same order):
In other words, it is permitted to specify WARC-Payload-Digest
on any subset of the segments, which means that it can also be included only in the last segment but omitted from all others.
(I used the current version 1.1 of the spec above. For the record, in 1.0, I arrive at the conclusion that there is an ambiguity because it is permitted to specify WARC-Payload-Digest
on continuation records but its value is not specified.)
Am I misreading something?
@JustAnotherArchivist, yes, I agree with your reading. The change in 1.1 removed the guidance on which segment should have the field. Note that lack of guidance is one point Justin and Kris were both objecting to in the comments above but I guess it passed anyway.
The record segmentation states:
This requires that a writer have the entire logical record before being able to write the first segment; in the case of really large payloads (which record segmentation is intended to support), this may be problematic.
My recommendation is to move this to last segment.
As some additional background: As part of the Social Feed Manager project, we're working on recording the Twitter streaming API to WARCs. As part of the streaming API, the HTTP response is kept open for an extended period of time (as in hours or days).