internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.83k stars 762 forks source link

Downloading the surrounding web pages #380

Closed krojtous closed 2 years ago

krojtous commented 3 years ago

I want to download one website and all the web pages linked from that website.

I have only one seed, "avcr.cz" and the settings are as follows: maxHops: 5 maxTransHops: 1 maxSpeculativeHops: 1

In this case, however, I only download the surrounding web pages from a few sites, particularly Youtube and Spotify. Almost completely missing all the web pages from standard websites.

For example, in the "scope" log, there is this entry:

2021-05-04T11:35:57.143Z 0 RejectDecideRule REJECT http://academia.cz/

The academia.cz page is linked to right from the first homepage of avcr.cz (which is where the crawler gets to after two redirects). So it would make sense to me that based on the maxTransHops rule, the crawler should accept it.

It kind of looks to me like crawler is only downloading embed pages, but I don't know that for sure.

Is there any setting for the crawler to download all web pages linked from the downloaded website?

ato commented 3 years ago

My understanding is the "trans" in maxTransHops is transclusion (which often people call 'embed'), so applies to stylesheets, images etc not normal links.

Is there any setting for the crawler to download all web pages linked from the downloaded website?

I believe setting alsoCheckVia property to true on the acceptSurts (SurtPrefixedDecideRule) bean does this. The documentation describes it as:

Whether to also make the configured decision if a URI's 'via' URI (the URI from which it was discovered) in SURT form begins with any of the established prefixes. For example, can be used to ACCEPT URIs that are 'one hop off' URIs fitting the SURT prefixes. Default is false.

krojtous commented 3 years ago

Thanks a lot!