internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.77k stars 757 forks source link

Heritrix not working behind proxy #316

Closed ArtHoff closed 1 year ago

ArtHoff commented 4 years ago

Hello,

I'm attempting to archive some of our agency sites and have run into this issue. The agency sites themselves do not go through our proxy, and pages are archived fine. However there is content that needs to be pulled in from other sites. These sites go through our proxy and are whitelisted. Getting content for them works fine through a bash shell using wget. But in a Heritrix crawl these sites can't be reached. They time-out. I then added our proxy details to the crawler-beans.cxml. Now nothing is indexed, not even the agency sites. I then asked asked for our department sites to be made available through the proxy too, but still nothing gets indexed. Our network team tells me that all connections on the proxy are successful, however Heritrix still times out. Is this a bug or user error? What do I need to do to make this work behind the proxy.

Thank you for any pointers you can give me.

ato commented 4 years ago

If the sites you are trying crawl cannot be resolved through (local) DNS then Heritrix is currently unable to archive them. See issue #211 for discussion of the reason for this limitation, an outline of the changes that would need to be implemented for Heritrix to work in this situation and a possible workaround.

Unfortunately it sounds like in your case the sites are on some sort of private intranet without public DNS records. If so the dns-over-https workaround suggested in #211 will likely not help you. If you do not have access to a working DNS server for these sites I guess one workaround that might work in your situation is to configure Heritrix to run against a local DNS server with dummy records (e.g. DNS wildcards).

ato commented 4 years ago

Actually re-reading this - the sites you're having problems with are public internet sites? Then the dns-over-https workaround might actually work for you.

anjackson commented 4 years ago

These aren’t HTTPS URLs are they? You might be hitting #191

ArtHoff commented 4 years ago

Hi there, thank you both for getting back to me so quickly. Yes, all these URLs are https. Most, if not all of the content from the tools/frameworks we use is delivered via https nowadays. I suspect that http will be very rare in the near future with the way browser security is progressing.

I saw #191, but on our installation I don't see the error message that it mentions.

So I guess that means the workaround in #211 won't work for us and there is no other way to archive our sites using Heritrix in our environment at this time? I wonder what other options are available?

Thank you.