internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.83k stars 763 forks source link

"RIS already open for ToeThread..." exception during https pages crawl over proxy #191

Closed WI-IT closed 2 years ago

WI-IT commented 7 years ago

When I try to crawl https pages over a proxy with Heritrix 3, I get following exceptions:

java.io.IOException: RIS already open for ToeThread #5: https://www.XXX/robots.txt at org.archive.io.RecordingInputStream.open(RecordingInputStream.java:84) at org.archive.util.Recorder.inputWrap(Recorder.java:185) at org.archive.modules.fetcher.FetchHTTPRequest$RecordingHttpClientConnection.getSocketInputStream(FetchHTTPRequest.java:648) at org.apache.http.impl.BHttpConnectionBase.ensureOpen(BHttpConnectionBase.java:131) at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestHeader(DefaultBHttpClientConnection.java:140) at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:203) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.archive.modules.fetcher.FetchHTTPRequest.execute(FetchHTTPRequest.java:751) at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:658) at org.archive.modules.Processor.innerProcessResult(Processor.java:175) at org.archive.modules.Processor.process(Processor.java:142) at org.archive.modules.ProcessorChain.process(ProcessorChain.java:138) at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)

marhop commented 5 years ago

I can confirm this. The exception is thrown only for HTTPS hosts, plain HTTP works fine with a proxy. What's worse though, as soon as Heritrix encounters an HTTPS URL it runs into a -404 ""Empty HTTP response interpreted as a 404" error. (This may be coincidence, but the correlation looks suspicious enough to me.)

This could be related to iipc/webarchive-commons#64 where @kris-sigur hinted at a possible cause:

First thought is that when crawling HTTPS via proxy, Heritrix fails to properly close the RecordingInputStream

Looking at the source code I have to admit though that I have no idea where this happens (or if this is in fact the cause of this behaviour), so I cannot offer you a bugfix ... Would be great if someone else can! :smile:

Thanks, Martin

danielbicho commented 5 years ago

Any update about this? I am just facing the same problem.

I notice several problems here:

CONNECT command problem

I noticed that Heritrix/HttpClient is sending the CONNECT command wrongly and some proxies don't accept it. I tried with Warcprox and Charles proxy and both complain about it. Heritrix is sending something like: CONNECT sobre.arquivo.pt HTTP/1.0, which is wrong because it should specify the port number: CONNECT sobre.arquivo.pt:443 HTTP/1.0. (Can someone clarify this, it was my interpretation of the specification).

Changing the ROUTE_PLANNER in FetchHTTPRequest to specify the HttpHost port instead of passing -1 value solves this problem, the CONNECT command is sent in the right way then.

The RIS already open problem.

What I concluded is that while opening a TUNNEL with HTTPS the HttpClient will call the getSocketInputStream() 2 times, wrapping a java.net.SocketInputStream first and then wrapping a sun.security.ssl.AppInputStream. There is no way here Heritrix can know about this behaviour since its delegating the connection operations to the HttpClient.

Also if I try to properly close the java.net.SocketInputStream before wrapping the sun.security.ssl.AppInputStream it will then complain that the Socket is closed when it tries to write. The solution here iipc/webarchive-commons#64 seems enough, and I agree that there is no need to throw an exception if the RecordingInputStream is already wrapping a stream.