iipc / openwayback

The OpenWayback Development
http://www.netpreserve.org/openwayback
Apache License 2.0
473 stars 271 forks source link

Handling Truncated Records? #254

Open PsypherPunk opened 9 years ago

PsypherPunk commented 9 years ago

We've just noticed a few timeTrunc errors in our crawl logs and the resulting WARC-Truncated: time headers in our WARC records. Does OpenWayback handle these specifically?

At the moment it just seems to silently fail to render anything. I'm not sure, however, what it should be doing. For instance:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.---.com/images/---.jpg
WARC-Date: 2015-04-13T18:59:40Z
WARC-Payload-Digest: sha1:7VXVD5DY2ORDVSV3L46PFR2LS2BDU4ZN
WARC-IP-Address: ---.---.---.---
WARC-Truncated: time
WARC-Record-ID: <urn:uuid:cae241d0-6839-4bb1-a0c2-9ab65799b557>
Content-Type: application/http; msgtype=response
Content-Length: 1368

HTTP/1.1 200 OK
Date: Mon, 13 Apr 2015 18:59:41 GMT
Server: Apache
Last-Modified: Mon, 16 Apr 2012 09:16:55 GMT
ETag: "3d47296-dd8f-4bdc848972690"
Accept-Ranges: bytes
Content-Length: 56719
Connection: close
Content-Type: image/jpeg

ÿØÿà^@^PJFIF^@^...

Here we've a partial JPEG—should OpenWayback return the partial? Or treat it like a revisit and try to find the nearest record (although the hash would be wrong...)?

nlevitt commented 9 years ago

I vote for returning the partial. Could be useful in the case of audio streams for example. We happened to run into such a capture today, where we got a 2 hour snapshot of http://mozart.wkar.msu.edu/wkar-fm-mp3

kris-sigur commented 9 years ago

Wouldn't it make more sense to display an interstitial page explaining the issue and offering a link to the partial content? If we just present the partial content without any context, users are likely to conclude that there is an issue with the archive service.

anjackson commented 9 years ago

This is another one of those times I'd prefer a iframe approach, as the fact that this is known to be damaged from capture could be relayed around the edge.

If that's too difficult, I'd rather we provided an interstitial, but either way, I think we should be able to return the item as-is rather than hide it.

nlevitt commented 9 years ago

Sure, interstitial is a good idea, although my guess is that truncated urls are usually not gonna be html pages viewed at the top level. The examples mentioned here are a jpeg and an audio stream, so the interstitial wouldn't come into play in these cases (right?)

kris-sigur commented 9 years ago

No, at least not when viewing them as embedded resources. Only when they are accessed directly.