iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

WARC revision 1.1 (augmentation): create a new record type to record differences #30

Open saraaubry opened 9 years ago

saraaubry commented 9 years ago

A new record type shall be created to record diff results from the orginal record (and not store the new one). Full description to come by Eld Zierau.

ikreymer commented 9 years ago

A new record type has been approved for addition to the 1.1 standard to 'record differences' ?

I don't see any previous discussion of such a record type or what it means..

Given how many suggestions were marked as not mature enough for addition to the standard, it seems rather odd that a completely new record type has been approved without any previous discussion on this board whatsoever, if I understand this correctly..

ikreymer commented 9 years ago

Or is this the placeholder for such a discussion about the proposal? Sorry, it was unclear as I received this update in the midst of the other updates on here regarding the decision from the previous meeting 1.1 meeting.

kris-sigur commented 9 years ago

I agree with @ikreymer. It would seem very hasty to include a new record type specification that has had near zero practical use or review.

A clarification would be most welcome, @saraaubry

anjackson commented 9 years ago

I think is that there is a practical implementation of this, but AFAIK only one, at @elzikb's organisation. I'd rather know that one or more organisations actually use (or at least intend to use) it before making it part of an international standard.

saraaubry commented 9 years ago

The idea is not to rush this topic into the 1.1 revision but open the discussion. @elzikb has a practical use case for this. She will give more details about it.

kris-sigur commented 9 years ago

Good to hear. Also, in that case, I'd like to point @elzikb (and other interested parties) at this workshop I'm trying to organize: http://kris-sigur.blogspot.is/2015/11/workshop-on-missing-warc-features.html

elzikb commented 9 years ago

Let me first say that in my mind, it was covered in this issue. However, at the ISO WARC meeting Tuesday, it turned out that we had different interpretations of this issue, - and we have had a discussion (Tuesday) of how it can be incorporated in WARC in a way that is streamlined with the principles in WARC. Therefore I have been sent back to formulate the changes in form of description, examples and use cases, where this formulation can be included in the WARC standard if it is clear enough.

We have an actual need at KB for this, since we have chosen WARC as the packaging format for all digital materials (arguments presented at iPres 2012 https://phaidra.univie.ac.at/detail_object/o:293682) but we are aware that minor changes may need bit preservation, which can be expensive for minor corrections to large sources or metadata. This is the motivation for the change (and I believe it can turn out to be useful to web materials as well).

The suggested description, examples and use cases will be available later, - and I hope it will answer a lot of these questions.

elzikb commented 9 years ago

Sorry an error got in there - the "misunderstood" issue were https://github.com/iipc/warc-specifications/issues/11

nclarkekb commented 9 years ago

To avoid locking anyone into using a format they do not like. The diff record should probably include the same king of content-type information used to record http request/response payloads.

That is something similar to "Content-Type: application/warc-diff; format=rfc3284/vcdiff"

One example could then be https://tools.ietf.org/html/rfc3284

elzikb commented 8 years ago

Sorry for the delay - here it is in the attached pdf – it is the latest document with smaller changes: • Accepted changes are not show, and related comments of accept are deleted • There is an extra comment about an example of use case such record type used for a record with a resource (not just metadata).

The record type started out being named "diff", since that is the most obvious way, but in the attached document it is called "WARC-representation", in order to be more general. WARC_representation_typeCLO_elziCLOelzi.pdf

nlevitt commented 8 years ago

Why was the name "representation" chosen? It doesn't make a lot of sense to me. I would change it back to "diff".

The text is not very clear about an important aspect of this record type.

5.8 WARC-Representation-Algorithm The WARC-Representation-Algorithm is used in ‘representation’ records. This parameter indicates the algorithm applied to an original object (pointed to by the WARC-Refers-To) in order to get the represented object.

6.6 'representation' A ‘representation’ record shall contain a representation expressed in terms of a modification of an original record's content that was created as the result of an archival process. Typically, this is used to hold smaller content updates...

What I surmise, is that the WARC-Representation-Algorithm has the form d = f(a, b) (it takes two arguments) where a is the payload of the record referred to by the WARC-Refers-To header, and b is a new fetch of the WARC-Target-URI. Another thing that is sort of mentioned (in the 5.11 WARC-Representation-Digest section), but needs to be explained better, is that b can be computed from a and d, by some other algorithm, i.e. b = g(a, d). (In the examples, g would be the /usr/bin/patch algorithm).

Maybe I'm wrong about that assumption, now that I notice this sidebar:

Kommentar [e2]: The use case is on a file suggesting binary comparison. It can also be a resource of marked up text including scanned text where errors are later discovered in the OCR.

Interesting use case. But it once again points to a missing piece in the proposal. How would you know, looking at the warc record, that it represents this? And importantly, that there was no new fetch of the url? If the value of WARC-Representation-Algorithm is something like DIFF and nothing more, there's no way to tell.

In the same sidebar: "Kommentar [CO3]: Not possible on response records?" Good question. And in the case of http response records, what about the http headers? Any special handling? My guess is that the diff would be assumed to apply only to the http payload. This needs to be spelled out.

Another problem with the proposal is that the original record can only be specified with WARC-Refers-To. As @kris-sigur has pointed out, no one indexes warcs by record id, and there wouldn't be any way to find the original record without implementing such an index, which would be a major effort. This problem could be resolved by allowing (requiring?) WARC-Refers-To-Target-URI and WARC-Refers-To-Date.

It would also be nice to have this as a pull request. https://github.com/iipc/warc-specifications/pulls

kris-sigur commented 8 years ago

If allowed, use of WARC-Refers-To-Date should probably be optional. Use of WARC-Refers-To-Target-URI is probably nonsensical as it would always be the same as WARC-Target-URI.

I do share Noah's concerns about the term "representation". It is very non-intuitive.

I also have misgivings about the utility of this addition and would prefer to see the use case explored more fully before adding it to the standard. Such exploration would need to include the release of tools that both read and write such records. At the moment it is next to impossible to quantify the actual practical gains (or drawbacks!) of this type of record. At least I haven't seen anything examining it. If such a research exists, do share!

ikreymer commented 8 years ago

Again, echoing my earlier comment, and this may sound harsh, but I think needs to be said.

It seems like this new record is being pushed through by an organization close to the ISO approval process without any visibility or use case by anyone else. That seems to go against the intention of this open process.

The understanding was that the 1.1 standard would only support small changes or changes that were discussed and accepted by the community. To become an international standard, there should be multiple organizations committed to using this new record type, and several well developed open source tools supporting or intending to support it. This is clearly not the case at the moment and this is proposal is much less mature than other possible changes that were deemed not mature enough.

If this new record type makes it into 1.1 at the insistence of a single organization, then this whole open process of approving/suggesting changes becomes totally moot and irrelevant.

ikreymer commented 8 years ago

As an example, here are some possible WARC spec additions that are already in use: http://wpull.readthedocs.org/en/master/warc.html

IA stores many warcs created by wpull.

pywb supports reading these warcs, especially the youtube-dl metadata for loading videos (http://wpull.readthedocs.org/en/master/warc.html#youtube-dl)

brozzler may also support this format for youtube-dl (@nlevitt ?). So we have 2 or 3 tools already supporting particular specs to address certain common issues in web archiving (eg. video), and a de facto way of dealing with them.

In my opinion, these are the kinds of things that we should be standardizing, ways to address common problems in web archiving, not specific needs of a single institution.

nlevitt commented 8 years ago

@kris-sigur commented 7 hours ago

Use of WARC-Refers-To-Target-URI is probably nonsensical as it would always be the same as WARC-Target-URI.

Oh? Would it always be the same as WARC-Target-URI? That's another thing I didn't see specified in the proposal.

@ikreymer Yes brozzler+warcprox writes those youtube-dl metadata records.

My feeling is also that this isn't ready for standardization. I'd like to see some proven use cases and implementations. Nothing should prevent the feature from being implemented, even if it isn't standardized this time around.

elzikb commented 8 years ago

I am sorry if this has seemed to be trying to pushing something through that is not mature enough. I have no such intensions and I do NOT think it should jeopardize a smooth accept of the minor changes in ISO, - thus if this is the case it should NOT go into revision 1.1, but be postponed to work with the next revision.

I do, however, also think that the comments make it look like that this is a much bigger change than it is. As said in the top of the document the change would cover: • A new record type representation (or difference) • Two new fields: WARC-Representation-Algorithm (or WARC-Difference-Algorithm) and WARC-Representation-Digest (or WARC-Difference-Digest) And edits in existing fields in order to include the above … Coming back to the other comments:

RESEARCH ON THIS TOPIC: There have not been specific research on the way the record looks, - but the need for it has been discussed and has been background for choosing and implementing WARC for our ordinary repository data at has been published as a iPres paper "Package Formats for Preserved Digital Materiall", where WARC won the “price” as the best suited packaging format for preservation, partly because of the possibility to make a difference record. It should be noted that the need for it is specifically related to non-web material, although I think it can maybe become relevant for the revisit record as well as described in issue #11

REFERENCE TO ORIGINAL RECORD: In the proposal, there are no new WARC-Refers-To… field – only an extension of the original description of the WARC-Refers-To field. Therefore I see the point of ‘having no indexes Warcs by record id’ as a general point that would be relevant for any use of the WARC-Refers-To (as the WARC standard do not enforce different records to be in the same WARC file, and a very similar example would be for the conversion record type). However, in general, I certainly think that this is a legitimate point, especially for web data, but it is not specific for the new record. And I agree with Kris that adding an optional WARC-Refers-To-Date could be a good idea for meeting such issues in general.

THE ALGORITHM: If the WARC-Representation-Algorithm is f then it is correctly understood that this algorithm it takes two arguments on form has the form d = f(a, b) where a is the payload of the record referred to by the WARC-Refers-To header, and b is a new version of it with minor changes (could be fetch of the WARC-Target-URI). And it is also correct that b can be computed from a and d, by some other algorithm, b = g(a, d) where g is the inverse function of f – and I agree that this could be more clear.

THE NAMING OF THE RECORD: I can see that a generalization of the ‘difference’ record as a ‘representation’ record seems to make more confusion than what is good. My point was that it may be a record that can be used for other algorithms than difference algorithms, - so these would not need a new definition of a record just because of its name. I have tried to find some examples, which e.g. could be encrypted records (due to need for bit preservation where copies of confidential material are stored under different jurisdictions), but these do not hold. Thus there are no good cases for it – yet. For preservation, it is important that the records are easily understandable, and if the naming confuses (and I cannot find good alternatives to representation) then it is better to have less generalization and more record names, even if the semantics is similar. Conclusion: It should be called “difference”

THE IIPC WARC WORKSHOP: As Kris pointed out there is a WARC workshop on IIPC where this can be presented and discussed in more detail. Hopefully this ends up with a good discussion where we do not end up using several different record types for the same purpose ;)

saraaubry commented 6 years ago

This point has not been solved in WARC 1.1 revision.