internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.78k stars 757 forks source link

Do not require DNS when using a web proxy #211

Closed marhop closed 2 years ago

marhop commented 6 years ago

Hi,

I am trying to crawl some WWW domain from behind a web proxy in a corporate network using Heritrix build 3.3.0-20180727.011238-114 without success. Heritrix just hangs right after the crawl is started. I suppose this is caused by Heritrix taking a rather unusual approach to DNS queries when using a web proxy. Let me explain:

The DNS servers in our corporate network only resolve host names from our local network. (And I cannot use external DNS servers because of firewall rules.) That's OK because all external requests are routed through a web proxy anyway. A client tells the web proxy the (external) URL it wishes to access, and the web proxy takes care of everything, including DNS resolution (which is done by forwarding the request to a parent proxy that does the actual DNS resolution using other, "more knowledgable" DNS servers).

Long story short, when using a web proxy, it's not necessary to query the DNS. Most tools do it this way, and "just work". (Example: curl makes DNS requests only when not using a web proxy.)

However, Heritrix always makes DNS requests, regardless of its proxy configuration. If a DNS request fails, it does not even go on asking the proxy for the URL it tries to crawl, although the proxy could easily fetch the content. Instead, it just hangs.

So may I suggest that Heritrix should not query the DNS when it is configured to use a web proxy? Or if it does, that at least it should continue asking the web proxy even if the DNS request fails?

Probably related: #198

Thanks, Martin

ato commented 6 years ago

Hi Martin,

Heritrix does its own DNS lookups as it writes the DNS records and IP addresses to the WARC files. Other features like geolocation and ip address decide rules also depend on knowing the IP addresses. Ignoring hostnames that do not resolve while perhaps not essential likely also helps keep a certain amount of garbage URLs out of the queues early.

If you don't need those features and would like to try to modify Heritrix to work without DNS, a quick and dirty workaround might be a new implementation of org.archive.modules.net.ServerCache which returns a dummy IP address for every lookup. A proper solution would need an option to disable the DNS pecondition check in PreconditionEnforcer and to modify the WARC and ARC writers to work without IP addresses. Right now they assume IP addresses are always available.

Cheers,

Alex

marhop commented 6 years ago

Hi Alex,

thanks for the quick and thorough explanation, greatly appreciated! Should I read your second paragraph more like "we're looking at it" or "we definitely won't do it ourselves, but wouldn't reject a decent pull request either"?

Thanks, Martin

ato commented 6 years ago

"we definitely won't do it ourselves, but wouldn't reject a decent pull request either"

I can't speak for all Heritrix contributors but I suspect the answer for most would be this one. ;-)

marhop commented 6 years ago

OK, good to know. Thanks again!

marhop commented 5 years ago

I tried to make some modifications but finally gave up because I cannot judge the global implications of DNS queries and IP addresses in Heritrix without digging really deep into the codebase.

Here's a possible workaround however, for anyone with a similar problem: Use DNS over HTTPS to query an external DNS server via your web (HTTPS) proxy. I achieved good results with dnss which plays together well with a web proxy provided you use a recent version. The configuration details are specific to your environment, but the general idea is this: A tool like dnss can run as a service on your local machine listening on localhost:53. If you configure your network settings to use 127.0.0.1 as your DNS server all DNS queries, and particularly those made by Heritrix will go to localhost:53, from where dnss forwards them via HTTPS (and thus via your web proxy whose IP address it is able to pick up from environment variables) to an external DNS server like 1.1.1.1 (Cloudflare) or 8.8.8.8 (Google). That way it is possible to use an external DNS server regardless of firewall constraints blocking port 53.

ClemensRobbenhaar commented 2 years ago

I am not sure if it is a "decent" one, but here is a PR to add DoH support: https://github.com/internetarchive/heritrix3/pull/476