iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
49 stars 72 forks source link

HTTPS via a Proxy #64

Open PsypherPunk opened 7 years ago

PsypherPunk commented 7 years ago

I've trying to crawl a HTTPS site through a Squid proxy and keep seeing errors like these:

java.io.IOException: RIS already open for ToeThread #12: https://XXX/robots.txt
   at org.archive.io.RecordingInputStream.open(RecordingInputStream.java:84)
   at org.archive.util.Recorder.inputWrap(Recorder.java:185)
   at org.archive.modules.fetcher.FetchHTTPRequest$RecordingHttpClientConnection.getSocketInputStream(FetchHTTPRequest.java:649)
   at org.apache.http.impl.BHttpConnectionBase.ensureOpen(BHttpConnectionBase.java:131)

HTTP sites are fine but HTTPS just doesn't seem to work. The problem seems to be down to the RecordingInputStream and RecordingOutputStream, both of which throw an IOException if the underlying Stream is != null.

If, however, I comment out those checks, the HTTPS crawl works perfectly (as far as I can tell...). I'm not sure whether this is the webarchive-commons library being overly cautious or heritrix3 failing to do something for HTTPS sites.

kris-sigur commented 7 years ago

First thought is that when crawling HTTPS via proxy, Heritrix fails to properly close the RecordingInputStream (these are thread local).