8. Archiving... - Githubissues

marcoscaceres commented 7 years ago

Since future consumers of publications represent the most open-ended user group, it is desirable that digital documents be instilled with more of the inherent durability that characterizes print artifacts.

The view about archivability of books presented in this section is archaic and romantic.

I don't want to go all techno-deterministic here, but this comparison is so overly simplistic. Firstly, you can't really compare the "inherent durability that characterizes print artifacts", when we are talking about organic matter that without special environmental condition rapidly decomposes, is susceptible to water damage, can be eaten by bugs, end is highly combustible.

Yes, both media require upkeep - but there is nothing inherently more durable about print than digital media. Digital media is infinitely reproducible, transferable at the speed of light, weightless, and broadcast-able etc. Did I mention the internet was designed to survive a nuclear attack?... and you can access the internet from space :)

Anyway, the point is that print media doesn't poses magical qualities that make more suited to be archive than digital formats. That's a romantic view of print, and demonstratively false for anyone who's ever lend a book to a friend and never got it back.

Req. 37: The locations of all PWP components should be discoverable.

This doesn't make sense, specially for ergodic texts (i.e., texts that also include, or are, games, or dynamic aspects, etc.).

Req. 38: There should be a way to discover that a new version of one or more PWP components have been published.

This should be left up to authors - to notify readers (including bots) that a new version is available.

Req. 39: There should be a way to discover that one or more new components have been added to a PWP.

This is a halting problem. For a dynamic text, this is impossible without running through the whole text (which may be infinite) - and then assumes that the text has not changed from the time the bot starts reading to the end of the work.

Req. 40: There should be a way to discover that one or more PWP components have been removed from a PWP.

As above. This is just a duplicate of 39. Authors are best left to manage their publication's resources.

lrosenthol commented 7 years ago

@marcos - I don't understand why the idea that it be possible to locate all of the components of a PWP is problematic. It's no different (in concept) that the ability to spider a website - just would make things easier if the list of items to spider was known in advance (instead of having to parse everything).

Perhaps its the term component - I prefer the term "resource" myself. In other words, it's not about whether a script changed something in the DOM (or stored content via REST. etc.) - but instead about whether new resources (eg. images, CSS, etc.) were added or deleted to the list mentioned above.

On Wed, Sep 28, 2016 at 4:20 AM, Marcos Cáceres notifications@github.com wrote:

Since future consumers of publications represent the most open-ended user group, it is desirable that digital documents be instilled with more of the inherent durability that characterizes print artifacts.

The view about archivability of books presented in this section is archaic and romantic.

I don't want to go all techno-deterministic here, but this comparison is so overly simplistic. Firstly, you can't really compare the "inherent durability that characterizes print artifacts", when we are talking about organic matter that without special environmental condition rapidly decomposes, is susceptible to water damage, can be eaten by bugs, end is highly combustible.

Yes, both media require upkeep - but there is nothing inherently more durable about print than digital media. Digital media is infinitely reproducible, transferable at the speed of light, weightless, and broadcast-able etc. Did I mention the internet was designed to survive a nuclear attack?... and you can access the internet from space :)

Anyway, the point is that print media doesn't poses magical qualities that make more suited to be archive than digital formats. That's a romantic view of print, and demonstratively false for anyone who's ever lend a book to a friend and never got it back.

Req. 37: The locations of all PWP components should be discoverable.

This doesn't make sense, specially for ergodic texts https://en.wikipedia.org/wiki/Ergodic_literature (i.e., texts that also include, or are, games, or dynamic aspects, etc.).

Req. 38: There should be a way to discover that a new version of one or more PWP components have been published.

This should be left up to authors - to notify readers (including bots) that a new version is available.

Req. 39: There should be a way to discover that one or more new components have been added to a PWP.

This is a halting problem. For a dynamic text, this is impossible without running through the whole text (which may be infinite) - and then assumes that the text has not changed from the time the bot starts reading to the end of the work.

Req. 40: There should be a way to discover that one or more PWP components have been removed from a PWP.

As above. This is just a duplicate of 39. Authors are best left to manage their publication's resources.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/w3c/dpub-pwp-ucr/issues/131, or mute the thread https://github.com/notifications/unsubscribe-auth/AE1vNetf6I3QSgDOEW1HN9p0o-fIFEKaks5qud0ZgaJpZM4KIYJD .

TzviyaSiegman commented 7 years ago

@lrosenthol I agree that this could use a little clarification. I am having a little trouble understanding it. I'm concerned that there is too much insider lingo here.

marcoscaceres commented 7 years ago

@marcos - I don't understand why the idea that it be possible to locate all of the components of a PWP is problematic. It's no different (in concept) that the ability to spider a website - just would make things easier if the list of items to spider was known in advance (instead of having to parse everything).

Parsing everything is the text. Consider ReSpec, the software that generates the document we are discussing: it dynamically inserts the references, the style sheets, etc. ReSpec has had 182 Releases that are independent of the markup that relies on ReSpec. The resources linked to by ReSpec constantly change (location, upgrades, deprecated and removed resources), but the authors of the document are none the wiser.

In other words, the author does not control where ReSpec pulls resources from. And if they try to guess, they would rapidly get that wrong because we (ReSpec maintainers or even the W3C) constantly change things.

Perhaps its the term component - I prefer the term "resource" myself. In other words, it's not about whether a script changed something in the DOM (or stored content via REST. etc.) - but instead about whether new resources (eg. images, CSS, etc.) were added or deleted to the list mentioned above.

In many cases, such as the ReSpec case, an author has no way of tracking this, and that is by design: whatever manifest they provided would be incomplete at best, and wrong at worst.

Of course, there are some texts that will be static - and ReSpec too supports this, with "Save to HTML"... but this robs the text of its dynamism.

iherman commented 7 years ago

Following up the ReSpec example: maybe the issue is that archiving is the equivalent of the "Save to HTML"; after all, storing the W3C documents in /TR is some form of archiving, because it provides a static but stable snapshot at a given time.

That being said, there is an extra potential insecurity even in the /TR version of a document: it relies on files (namely style sheet and logos) that are W3C wide, not stored locally with the document, and the only reason they are stable is because W3C has a social pledge not to change those files retroactively. When I generate an EPUB version from a ReSpec file, I have to ensure that all those references are made local as well, to make the EPUB 100% self-consistent as far as the essential resources for that document are concerned. (Ie, I do not make the links and the corresponding resources in the references local.) In the case of ReSpec to EPUB these extra but essential links (ie, the stylesheet references) are hardcoded in the code, but that is obviously not a general solution.

I am not familiar with the archiving community (I leave this to @tcole3 and @nullhandle), but I have the impression that generating an EPUB from the ReSpec document is an example for what archiving would do (in the realm of EPUB) if we keep to the ReSpec example. And maybe the real goal of the requirements should be to make such steps easy (or easier...).

marcoscaceres commented 7 years ago

Following up the ReSpec example: maybe the issue is that archiving is the equivalent of the "Save to HTML"; after all, storing the W3C documents in /TR is some form of archiving, because it provides a static but stable snapshot at a given time.

Yes, but again, this has lead to so many problems... "stable" as a misnomer, WHATWG vs W3C and the whole living standards debate, TR as "TRash", etc.

Trade-offs. There be dragons.

I am not familiar with the archiving community (I leave this to @tcole3 and @nullhandle), but I have the impression that generating an EPUB from the ReSpec document is an example for what archiving would do (in the realm of EPUB) if we keep to the ReSpec example. And maybe the real goal of the requirements should be to make such steps easy (or easier...).

The point being, some texts are well suited, and some texts are not well-suited, for archiving. That's the nature of hypertexts.

iherman commented 7 years ago

I think we should leave to the archiving people to tell us what they would/should consider 'stable'. I certainly do not want to talk in their name.

It is probably true that not all WP will be suited for archiving, at least not with a 100% accuracy. Obviously, the requirements for archiving a, say, legal text (which will require a 100% accuracy, probably) will be different than archiving an ephemeral document that is meant to evolve. But if a generic WP structure is such that it gives the authors/publishers the possibility to provide hints for what they would consider as important for archiving, we have made a bit step forward. That is what we should aim for, imho.

lrosenthol commented 7 years ago

@marcoscaceres

Parsing everything is the text. Consider ReSpec, the software that generates the document we are discussing: it dynamically inserts the references, the style sheets, etc. ReSpec has had 182 Releases https://github.com/w3c/respec/releases that are independent of the markup that relies on ReSpec. The resources linked to by ReSpec constantly change (location, upgrades, deprecated and removed resources), but the authors of the document are none the wiser.

Great example.

And why ReSpec would have its own "list of components" which would be maintained by its authors. And our document's list would reference the ReSpec list. This is one of the reasons for the "merge manifests" requirement. (though I was trying to avoid the 'm' word :)).

Of course, there are some texts that will be static - and ReSpec too supports this, with "Save to HTML"... but this robs the text of its dynamism.

Agreed - and we don't want that (in some cases) - more on that below...

lrosenthol commented 7 years ago

@iherman

Following up the ReSpec example: maybe the issue is that archiving is the equivalent of the "Save to HTML"; after all, storing the W3C documents in /TR is some form of archiving, because it provides a static but stable snapshot at a given time.

Archiving doesn't not require static (as different from dynamic or interactive). Archiving requires that you have a copy of all known constituent assets/resources/components so that the content cannot change UNPREDICTABLY.

That being said, there is an extra potential insecurity even in the /TR

version of a document: it relies on files (namely style sheet and logos) that are W3C wide, not stored locally with the document, and the only reason they are stable is because W3C has a social pledge not to change those files retroactively.

External references are not a bad thing - even for archiving - when you know that the references themselves are archived. Also, another reason towards the list of component is that you can change the list without changing the master document. This then allows the archived version of the document to point to the archived version of the stylesheets.

When I generate an EPUB version from a ReSpec file, I have to ensure that all those references are made local as well, to make the EPUB 100% self-consistent as far as the essential resources for that document are concerned.

For EPUB that is true. For PWP, there is no such requirement. You can have external resources in a PWP (as currently "use cased")

lrosenthol commented 7 years ago

On Thu, Sep 29, 2016 at 1:16 AM, Marcos Cáceres notifications@github.com wrote:

The point being, some texts are well suited, and some texts are not well-suited, for archiving. That's the nature of hypertexts.

That is incorrect.

EVERY text is suitable for archiving - or perhaps the word "snapshotting" would be more appropriate.

nullhandle commented 7 years ago

@marcoscaceres

I hadn't thought the introduction to the archiving section would be controversial. From my own experience working in libraries and archives for a decade, I don't think this is an exaggerated presentation of the different affordances of print and digital formats for preservation. That said, I don't think it's that vital to the draft, so rewrite if you'd like.

Regarding the archiving use cases generally, I'm coming from the perspective of an archiving organization that collects electronic scholarly publications by web harvest and file transfer from commercial publishing platforms. The requirements are informed by the specific challenges we encounter, described at length and with possible solutions in this recent paper.

Regarding Req. 37, it's true that this requirement won't work for works that are indiscrete by virtue of being interactive, programmatically-generated, etc. My notion of PWP is that they're discrete objects; otherwise, they couldn't be described by a manifest. Most of the objects we're interested in archiving are (for the moment) discrete, so we're hoping that PWP can fulfill this role.

Regarding Req. 38, an archiving service may be concerned with more granularly recording changes to the publication than those that an author might explicitly choose to advertise. It's moreover more likely to be the publisher vice the author that notifies of changes.

Regarding Req. 39, the platform on which the publication is hosted should have some notion of added resources. These could be communicated by a machine-readable feed (e.g., ResourceSync) without having to read the text.

Regarding Req. 40, I don't think I was suggesting that author's shouldn't manage their publication's resources. I just don't want the fidelity of the archival copy to be dependent on the author's diligence in approving that notifications of changes to the publication be shared with the archiving service.

marcoscaceres commented 7 years ago

@nullhandle, I agree that out of band (e.g., ResourceSync) archiving APIs and manifest, etc. are important - and the recommendations in the paper + good linking hygiene discussed in the paper make total sense.

I'm still worried that this group will attempt to replicate what ResourceSync already provides, however.

I'd be supporting of standardizing new link relationships in HTML, but anything not consumed by the reader should not be standardized at the W3C. The W3C should focus on browser stuff - other standards organizations should focus on formats.

iherman commented 7 years ago

I'd be supporting of standardizing new link relationships in HTML, but anything not consumed by the reader should not be standardized at the W3C. The W3C should focus on browser stuff - other standards organizations should focus on formats.

As I already commented elsewhere: I do not think I agree. The Web is larger than the browsers, and if there are stuffs that need standardization, is intimately related to the Web, and there is no other organization dealing with it, then I think it is perfectly legitimate for W3C to standardize it.

(Note that I said "there is no other organization dealing with it": clearly, any overlap and some sort of an unhealthy competition with other organizations should be avoided.)

marcoscaceres commented 7 years ago

As I already commented elsewhere: I do not think I agree. The Web is larger than the browsers, and if there are stuffs that need standardization, is intimately related to the Web, and there is no other organization dealing with it, then I think it is perfectly legitimate for W3C to standardize it.

There is a long brewing discussion to be had (not here!) about the scope of the W3C and if the Web is actually larger than browsers. Irrespective, I'd still caution to keep the scope of this work very constrained and evolve it incrementally - specially if this eventually transitions to a WG.

iherman commented 7 years ago

There is a long brewing discussion to be had (not here!) about the scope of the W3C and if the Web is actually larger than browsers.

Indeed.

I'd still caution to keep the scope of this work very constrained and evolve it incrementally - specially if this eventually transitions to a WG.

We will have a tedious work to do in chartering a WG indeed, and this issue, among many, will be part of the discussion... I hope we can count on you in commenting on the draft charter! (I should say 'drafts', ie, there will be several iteration, I am sure!)

marcoscaceres commented 7 years ago

Absolutely! Looking forward to it! ❤️ A lot of good things will come out of this work. ❤️

lrosenthol commented 7 years ago

I did a bunch of major surgery to the Archiving section, so that it no longer duplicated material found elsewhere in the document or within itself. It also now links to other related pieces directly, such as manifests and metadata. I also tried to address some of the issues in the thread above.

w3c / dpub-pwp-ucr

8. Archiving... #131