internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.83k stars 762 forks source link

Questions about TransclusionDecideRule #496

Closed cgr71ii closed 2 years ago

cgr71ii commented 2 years ago

Hi! I have some questions about TransclusionDecideRule which I'd appreciate if someone could answer:

The problem I think is that I don't understand very well the "trans" and "speculative" hops, even that I've read the wiki post about it.

Thank you!

ato commented 2 years ago

The first important thing to understand is that TransclusionDecideRule as used in the default config is an ACCEPT rule not a REJECT rule. This means it allows URIs that would otherwise be rejected to be accepted. In other words it strictly only widens the scope. If a URI is already accepted due to another rule such as by being in SURT scope it will have no effect on it.

For the purposes of the maxTransHops setting a transclusion hop is any hop that is not a regular navigation link ('L'), a form submission ('S') or a site-map ('M') link.

A speculative hop (X) is where Heritrix finds a something that looks like a URL in JavaScript source. Heritrix is not able to understand JavaScript code so it's speculating that it might be a URL based on some simple heuristics.

The option maxTransHops with the value 0 means that only text or similar will be downloaded? If so, a value of 0 I understand that might be counterproductive since we might be losing some URI's, is that right?

In the default configuration if maxTransHops to 0 then any URIs outside the SURT scope will be excluded.

For example imagine a page at http://site1/pages/home.html containing:

<img src=/pages/1.jpg>
<img src=/images/2.jpg>
<img src=http://site2/3.jpg>
<a href=/pages/4.html>
<a href=/other/5.html>
<a href=http://site2/6.html>

If maxTransHops is default value of 2 then the following URIs will be visited:

http://site1/pages/1.jpg
http://site1/images/2.jpg
http://site2/3.jpg
http://site1/pages/4.html

If maxTransHops is 0 then the following URIs will be visited:

http://site1/pages/1.jpg
http://site1/pages/4.html

The option maxSpeculativeHops with the value 0 means that only URI's from the same authority will be downloaded (if we ignore, for the purpose of the question, the rest of decide rules)?

Setting maxSpeculativeHops to 0 will mean the TransclusionDecideRule will have no effect on URIs with X (speculative JavaScript) hops anywhere in their hop path (as defined above). In the example I showed above it will have no impact at all as there's no JavaScript in the page.

The option maxSpeculativeHops with a greater value of 0 will get URI's if the maximum number of hops don't exceed the provided value, but will the URI's from the downloaded documents be included in the crawl?

I don't really understand this question.

The purpose of the TransclusionDecideRule in the default config is to capture transcluded content such as images, stylesheets needed to render a HTML page that's in scope even if the transcluded content itself would otherwise be out of scope.

My assumption is that the purpose of the maxSpeculativeHops setting is to prevent it from doing this when too much speculative JavaScript extraction is involved as speculative hops can often themselves be HTML pages (error pages etc) which can result in the inclusion of a whole chain of irrelevant junk. But I'm just guessing. I don't think I would have designed it like that myself.

The options maxTransHops and maxSpeculativeHops with the value 0 would lead to a crawl where only text, or similar but not rich-media content, would be downloaded from pages of the same authority?

Setting maxTransHops to 0 is effectively the same as disabling the TransclusionDecideRule entirely. It means scope will be determined strictly by the other decide rules which in the default config means only the acceptSurts rule and PrerequisiteAcceptDecideRule (robots.txt and DNS fetches).

The problem I think is that I don't understand very well the "trans" and "speculative" hops, even that I've read the wiki post about it.

I'm not surprised as that wiki page seems wrong or at least very misleading when it comes speculative hops and the maxSpeculativeHops setting. (I know it has my name on it, but that's because I migrated it from an older wiki. I didn't write most of the wiki pages.)

ato commented 2 years ago

It's hard to explain the relationship between the two settings in prose. If you can read the Java code I recommend that:

        // too many speculative hops disqualify from transclusion
        if (specCount > getMaxSpeculativeHops()) {
            return false;
        }

        // transclusion applies as long as non-ref hops less than max
        return nonrefCount <= getMaxTransHops();

maxSpeculativeHops is not independent of maxTransHops. If maxTransHops is 0 the rule will never match even if maxSpeculativeHops is positive.

cgr71ii commented 2 years ago

First things first: thank you for the detailed explanation! :)

I've been analyzing the code, and I think I don't understand very well the example you proposed. If I don't understand wrong, if we set "http://site1/pages/home.html" as seed, the hop path would be "L", and the rest of links which are imges would be "LE" (link + embed), and the links would be "LL" (link +link). Then, TransclusionDecideRule would be applied to every URI from our seed sequentially, and then:

I understand why you said that maxTransHops and maxSpeculativeHops are not independent, but I think that is not true that if maxTransHops is 0 is like if the rule would never be applied, but this would only be true for URIs which came from redirects due to c != Hop.REFER.getHopChar(), but not important, just an observation.

It is very likely that I didn't understand the explanation even that it was very good! Now I understand the 2 different types of hops which are used in this module. Thank you! Now I understand that some questions I made had not sense at all. Sorry for that. I though that this decide module was intended for something else, but now I get it is intended for try to get URIs from embeded elements or similar and the speculative limit is just for try to do not go too far speculating.

cgr71ii commented 2 years ago

I've noticed that in the defult configuration it is set that:

    <!-- ...but REJECT those more than a configured link-hop-count from start... -->
    <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
     <!-- <property name="maxHops" value="20" /> -->
    </bean>
    <!-- ...but ACCEPT those more than a configured link-hop-count from start... -->
    <bean class="org.archive.modules.deciderules.TransclusionDecideRule">

Since the crawl scope rules are applied sequentially, can't happen that TooManyHopsDecideRule rejects a URI and TransclusionDecideRule accepts the URI again? I know this behaviour is intended in the rules, but shouldn't the order be changed since TooManyHopsDecideRule is more "aggresive" and should prevail TransclusionDecideRule?

ato commented 2 years ago

My apologies. I neglected to mention the SURT scope in my example. If the seed URL was http://site1/pages/home.html and the SURT rules were generated automatically from the seeds (the default) then the automatic SURT scope would be http://(site1,)/pages/. The URIs would thus be accepted by the acceptSurts rule not the TransclusionDecideRule.

I think that is not true that if maxTransHops is 0 is like if the rule would never be applied, but this would only be true for URIs which came from redirects due to c != Hop.REFER.getHopChar(), but not important, just an observation.

Huh. That's a good point. So you're suggesting a hop path like LR would result in allCount = 1 and nonrefCount = 0 which would match even if maxTransHops is 0. So in order to prevent fetching of offsite redirects you'd need to actually set maxTransHops to -1 or disable the rule. That seems quite subtle and surprising.

Since the crawl scope rules are applied sequentially, can't happen that TooManyHopsDecideRule rejects a URI and TransclusionDecideRule accepts the URI again? I know this behaviour is intended in the rules, but shouldn't the order be changed since TooManyHopsDecideRule is more "aggresive" and should prevail TransclusionDecideRule?

My guess is that this is intentional. It's not uncommon to do shallow crawls by setting maxHops to a small value like 1 or 2 and in the typical web archiving use case Heritrix was designed for you'd not want to capture HTML pages without the embedded resources needed to render them.

cgr71ii commented 2 years ago

Ohh, ok, I think I got it! So, SurtPrefixedDecideRule is a rule that because of seedsAsSurtPrefixes = true, processes all the seeds and accepts all URIs which share the same prefix, so the seeds are processed to get the previous resource, which in this case would be http://(site1,)/pages/ in SURT form from our seed http://site1/pages/home.html. So:

  1. <img src=/pages/1.jpg> -> http://(site1,)/pages/1.jpg -> our seed prefix and the URI matches, so is accepted by SurtPrefixedDecideRule.
  2. <img src=/images/2.jpg> -> http://(site1,)/images/2.jpg -> our seed prefix and the URI doesn't match, but it is still accepted by TransclusionDecideRule if maxTransHops >= 1.
  3. <img src=http://site2/3.jpg> -> http://(site2,)/3.jpg -> our seed prefix and the URI doesn't match, but it is accepted by TransclusionDecideRule if maxTransHops >= 1.
  4. <a href=/pages/4.html> -> http://(site1,)/pages/4.jpg -> our seed prefix and the URI doesn't match, but it is accepted by TransclusionDecideRule if maxTransHops >= 1.
  5. <a href=/other/5.html> -> http://(site1,)/other/5.html -> our seed prefix and the URI doesn't match, and TransclusionDecideRule doesn't accept the URI either because the breadcrumb is "LL".
  6. <a href=http://site2/6.html> -> http://(site2,)/6.html -> our seed prefix and the URI doesn't match, and TransclusionDecideRule doesn't accept the URI either because the breadcrumb is "LL".

Now everything makes sense!

My guess is that this is intentional. It's not uncommon to do shallow crawls by setting maxHops to a small value like 1 or 2 and in the typical web archiving use case Heritrix was designed for you'd not want to capture HTML pages without the embedded resources needed to render them.

Oh, ok, I hadn't put it that way. Since the intention of internet archive is to render the crawled content, makes all the sense to be in the exact order it is. I was thinking in the case where all you need is text, which is my use case, and I don't think I need to download extra content.

Thank you for all the provided support! It's helped me a lot in order to understand a little bit better the way the URIs are either accepted or rejected in Heritrix!