sul-dlss / web-archiving

placeholder for web archiving work
0 stars 0 forks source link

probs downloading AIT warcs #32

Open ndushay opened 7 years ago

ndushay commented 7 years ago

Part One:

See Consul page https://consul.stanford.edu/display/WARC/Downloading+WARCs+from+Archive-It. Basically, from was-robos1-prod:/web-archiving-stage/app/ait_download, do the indicated command

no output, nothing in logs. console shows:

Retreiving list of possible archive files to download...
log4j:WARN No appenders could be found for logger (com.gargoylesoftware.htmlunit.WebClient).
log4j:WARN Please initialize the log4j system properly.
Done retreiving list of possible archive files to download!
Analyzing list to determine which files have already been downloaded...
Total number of downloadable archive files found: 0
Number of NEW archive files to download: 0
Completing downloading the files. 
Total number of files: 0
Successfully downloaded: 0
Failed to downloaded: 0

Solution: create log4j.properties file, put in /web-archiving-stage/app/ait_download/lib/log4j.properties and add

-Dlog4j.configuration=file:///web-archiving-stage/app/ait_download/lib/log4j.properties

to command line.

I set the logging level to DEBUG and sent output to console, so we got a big file (see /web-archiving-stage/app/ait_download/console.log.20170130.txt)

ndushay commented 7 years ago

Part Two

Looking at the console output (per above) at /web-archiving-stage/app/ait_download/console.log.20170130.txt, we get a ton of stuff. This is probably most pertinent:

536  [main] DEBUG org.apache.http.wire  - http-outgoing-0 << "<p>The document has moved <a href="https://warcs.archive-it.org/cgi-bin/getarcs.pl?coll=5425">here</a>.</p>[\n]"

and a little further down:

689  [main] DEBUG org.apache.http.impl.auth.HttpAuthenticator  - warcs.archive-it.org:443 requested authentication
690  [main] DEBUG org.apache.http.impl.client.TargetAuthenticationStrategy  - Authentication schemes in the order of preference: [negotiate, Kerberos, NTLM, Digest, Basic]
690  [main] DEBUG org.apache.http.impl.client.TargetAuthenticationStrategy  - Challenge for negotiate authentication scheme not available
690  [main] DEBUG org.apache.http.impl.client.TargetAuthenticationStrategy  - Challenge for Kerberos authentication scheme not available
690  [main] DEBUG org.apache.http.impl.client.TargetAuthenticationStrategy  - Challenge for NTLM authentication scheme not available
690  [main] DEBUG org.apache.http.impl.client.TargetAuthenticationStrategy  - Challenge for Digest authentication scheme not available
697  [main] DEBUG org.apache.http.impl.auth.HttpAuthenticator  - Selected authentication options: [BASIC]
699  [main] DEBUG org.apache.http.wire  - http-outgoing-1 << "<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">[\n]"
699  [main] DEBUG org.apache.http.wire  - http-outgoing-1 << "<html><head>[\n]"
699  [main] DEBUG org.apache.http.wire  - http-outgoing-1 << "<title>401 Unauthorized</title>[\n]"
699  [main] DEBUG org.apache.http.wire  - http-outgoing-1 << "</head><body>[\n]"
699  [main] DEBUG org.apache.http.wire  - http-outgoing-1 << "<h1>Unauthorized</h1>[\n]"
699  [main] DEBUG org.apache.http.wire  - http-outgoing-1 << "<p>This server could not verify that you[\n]"
700  [main] DEBUG org.apache.http.wire  - http-outgoing-1 << "are authorized to access the document[\n]"
700  [main] DEBUG org.apache.http.wire  - http-outgoing-1 << "requested.  Either you supplied the wrong[\n]"
700  [main] DEBUG org.apache.http.wire  - http-outgoing-1 << "credentials (e.g., bad password), or your[\n]"
700  [main] DEBUG org.apache.http.wire  - http-outgoing-1 << "browser doesn't understand how to supply[\n]"
700  [main] DEBUG org.apache.http.wire  - http-outgoing-1 << "the credentials required.</p>[\n]"
700  [main] DEBUG org.apache.http.wire  - http-outgoing-1 << "<hr>[\n]"
700  [main] DEBUG org.apache.http.wire  - http-outgoing-1 << "<address>Apache/2.4.7 (Ubuntu) Server at warcs.archive-it.org Port 443</address>[\n]"

So next, we tried updating the URL hardcoded into hrwa_manager (https://github.com/sul-dlss/hrwa_manager/blob/master/src/main/java/edu/columbia/ldpd/hrwa/tasks/DownloadArchiveFilesFromArchivitTask.java#L34-L35). This was done in a branch warcs-url. I then created the standalone jar using mvn install (it ends up in target/ directory); scp'ed the new jar to was-downloader and tried the same thing again.

ndushay commented 7 years ago

Part Three

I ran the same command using a standalone-jar generated with the url warcs.archive-it.org hardcoded in. Got a similar log with essentially the same errors.