Open ndushay opened 7 years ago
Looking at the console output (per above) at /web-archiving-stage/app/ait_download/console.log.20170130.txt
, we get a ton of stuff. This is probably most pertinent:
536 [main] DEBUG org.apache.http.wire - http-outgoing-0 << "<p>The document has moved <a href="https://warcs.archive-it.org/cgi-bin/getarcs.pl?coll=5425">here</a>.</p>[\n]"
and a little further down:
689 [main] DEBUG org.apache.http.impl.auth.HttpAuthenticator - warcs.archive-it.org:443 requested authentication
690 [main] DEBUG org.apache.http.impl.client.TargetAuthenticationStrategy - Authentication schemes in the order of preference: [negotiate, Kerberos, NTLM, Digest, Basic]
690 [main] DEBUG org.apache.http.impl.client.TargetAuthenticationStrategy - Challenge for negotiate authentication scheme not available
690 [main] DEBUG org.apache.http.impl.client.TargetAuthenticationStrategy - Challenge for Kerberos authentication scheme not available
690 [main] DEBUG org.apache.http.impl.client.TargetAuthenticationStrategy - Challenge for NTLM authentication scheme not available
690 [main] DEBUG org.apache.http.impl.client.TargetAuthenticationStrategy - Challenge for Digest authentication scheme not available
697 [main] DEBUG org.apache.http.impl.auth.HttpAuthenticator - Selected authentication options: [BASIC]
699 [main] DEBUG org.apache.http.wire - http-outgoing-1 << "<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">[\n]"
699 [main] DEBUG org.apache.http.wire - http-outgoing-1 << "<html><head>[\n]"
699 [main] DEBUG org.apache.http.wire - http-outgoing-1 << "<title>401 Unauthorized</title>[\n]"
699 [main] DEBUG org.apache.http.wire - http-outgoing-1 << "</head><body>[\n]"
699 [main] DEBUG org.apache.http.wire - http-outgoing-1 << "<h1>Unauthorized</h1>[\n]"
699 [main] DEBUG org.apache.http.wire - http-outgoing-1 << "<p>This server could not verify that you[\n]"
700 [main] DEBUG org.apache.http.wire - http-outgoing-1 << "are authorized to access the document[\n]"
700 [main] DEBUG org.apache.http.wire - http-outgoing-1 << "requested. Either you supplied the wrong[\n]"
700 [main] DEBUG org.apache.http.wire - http-outgoing-1 << "credentials (e.g., bad password), or your[\n]"
700 [main] DEBUG org.apache.http.wire - http-outgoing-1 << "browser doesn't understand how to supply[\n]"
700 [main] DEBUG org.apache.http.wire - http-outgoing-1 << "the credentials required.</p>[\n]"
700 [main] DEBUG org.apache.http.wire - http-outgoing-1 << "<hr>[\n]"
700 [main] DEBUG org.apache.http.wire - http-outgoing-1 << "<address>Apache/2.4.7 (Ubuntu) Server at warcs.archive-it.org Port 443</address>[\n]"
So next, we tried updating the URL hardcoded into hrwa_manager (https://github.com/sul-dlss/hrwa_manager/blob/master/src/main/java/edu/columbia/ldpd/hrwa/tasks/DownloadArchiveFilesFromArchivitTask.java#L34-L35). This was done in a branch warcs-url
. I then created the standalone jar using mvn install
(it ends up in target/
directory); scp'ed the new jar to was-downloader
and tried the same thing again.
I ran the same command using a standalone-jar generated with the url warcs.archive-it.org
hardcoded in. Got a similar log with essentially the same errors.
Part One:
See Consul page https://consul.stanford.edu/display/WARC/Downloading+WARCs+from+Archive-It. Basically, from
was-robos1-prod:/web-archiving-stage/app/ait_download
, do the indicated commandno output, nothing in logs. console shows:
Solution: create
log4j.properties
file, put in/web-archiving-stage/app/ait_download/lib/log4j.properties
and add-Dlog4j.configuration=file:///web-archiving-stage/app/ait_download/lib/log4j.properties
to command line.
I set the logging level to DEBUG and sent output to console, so we got a big file (see
/web-archiving-stage/app/ait_download/console.log.20170130.txt
)