ukwa / webarchive-explorer

Tools for exploring the contents of web archive files.
39 stars 6 forks source link

Explorer redirects to public URLs, does not extract versions of resources from WARC files #5

Open mjordan opened 11 years ago

mjordan commented 11 years ago

This is a real n00b question. Sorry if I'm missing something obvious.

I've pointed the Explorer at a set of WARC and ARC files and can get results back from my local Wayback machine query interface. However, the links to the resources returned from a successful query redirect to the resource's original public URL (via Redirect.jsp). I expected the WARC Explorer to serve up the archived version contained in the WARC file.

Is there a way to configure the Explorer to serve up the versions of the resource archived in the WARC and not redirect to the original URL?

anjackson commented 11 years ago

Can you give me some of the URLs you're visiting? Here's an example that's working for me:

http://localhost:18080/wayback/20120118143132/http://en.wikipedia.org/wiki/Main_Page

If you try that URL, you should get a Resource Not In Archive error (unless you also have a copy of this Wikipedia page in your archives).

Note that the 18090 port service is to be used as a web proxy, i.e. you should alter your browser settings to route all traffic through it, and the latest version of the resources will appear at their original URLS.

mjordan commented 11 years ago

OK, clear case of PEBCAK - I was using the proxy port, not 18080. Sorry about that.

However, performing queries at port 18080 (resulting in Wayback URLs like http://localhost:18080/wayback/*/http://drupalib.interoperating.info/taxonomy/term/10), I'm still not getting the WARC version of the resource. Instead, when I click on the link in the resulting date grid, I get the error

Resource Not Available

The Resource you have requested is temporarily unavailable. Please try again later.

The URL I'm requesting has a WARC-Target-URI entry in my WARC file, which is available at https://dl.dropboxusercontent.com/u/1015702/drupalib.interoperating.info.warc.gz if you want to try it yourself.

I did see some log entries in my cmd console that might be relevant.:

When I run the Explorer against an unzipped WARC file, it appears to index it without a problem. However, when i request a resource like the one above, the following error is logged:

INFO: Runtime Error
org.archive.wayback.exception.ResourceNotAvailableException: C:\hacking\warcs\warc\drupalib.interoperating.info.warc - java.util.zip.ZipException: Not in GZIP format
        at org.archive.wayback.resourcestore.LocationDBResourceStore.retrieveRes
ource(LocationDBResourceStore.java:96)

Suspecting that the WARCs need to be compressed, I ctrl-c'ed the Exporer and replaced my WARC file with the original .gz created by wget (i.e., the one available for download above). But then on indexing the following error is logged:

INFO: Queued drupalib.interoperating.info.warc.gz for indexing.
May 24, 2013 9:09:13 AM org.archive.wayback.resourcestore.indexer.IndexWorker do Work
INFO: Indexing drupalib.interoperating.info.warc.gz from C:\hacking\warcs\warcs\drupalib.interoperating.info.warc.gz
May 24, 2013 9:09:13 AM org.archive.wayback.resourcestore.indexer.IndexWorker do Work
SEVERE: FAILED to index or upload (drupalib.interoperating.info.warc.gz)
java.io.IOException: C:\hacking\warcs\warcs\drupalib.interoperating.info.warc.gz is not a WARC file.

If I'm reading these two errors correctly, they seem to contradict each other: querying expects the WARC to be gzipped, while indexing expects the WARC to be uncompressed (or at least doesn't recognize the .gz compressed version as a WARC file).

anjackson commented 11 years ago

Are you caching the index between attempts (via the -i flag?). You'll need to ensure the index DB is deleted, but assuming that's true, the uncompressed WARC should work fine.

Unfortunately, due to a bug in wget, it's compressed WARCs are not compatible with the WARC readers I'm currently using. IIRC, this came up in issue #1.

I'll have a go with your test file when I have time, but that might be next week.

anjackson commented 11 years ago

Actually the GZip issue was in ukwa/warc-discovery#1.

mjordan commented 11 years ago

Hi Andy, I started with a fresh index this morning (specified -i "path\to\folder"). Requests for pages under http://drupalib.interoperating.info/ (e.g., http://localhost:18080/wayback/20130524141558/http://drupalib.interoperating.info/node/256) using the uncompressed WARC file are all coming up with the error "Resource Not Available / The Resource you have requested is temporarily unavailable. Please try again later."

On request I also saw the same runtime error scroll past the console as before, "Not in GZIP format."

anjackson commented 11 years ago

Hm, odd. It seems the system does not support uncompressed WARCs (not sure where I got my false memory from), and the latest version of Wayback (1.8.0-SNAPSHOT) is not able to parse your WARC file. I have regressed back to 1.7.1-SNAPSHOT, and this seems to play your original compressed WARC just fine. I've updated the snapshot release and if you refer to the README you'll find a new link to this latest version.

mjordan commented 11 years ago

Andy, latest version works as intended on my gzipped WARC. Thanks very much for working through this with me.