ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Add "Scope+N Hops" scoping support #14

Open anjackson opened 6 years ago

anjackson commented 6 years ago

It would be handy to have a scoping mechanism that let the crawl run N hopes from the scope, ignoring redirects etc.

So, imagine we modify the outgoing links in a post-processor module, using a scope+L annotation to track how far off the original scope we are. Ideally, we could:

  1. Remove (but remember) any scope+? annotation from the outlink.
  2. Run the outline through the scope, and see if it would get accepted.
  3. If it would not get accepted, append the current hop to the scope+ annotation.
  4. If we have a useful scope annotation, e.g. scope+L, add it to the outline.
  5. Handle the outlink as normal.

This makes it possible to track how far off scope we are, but only works if we ALSO add a new decide rule that uses the scope+? annotation and ACCEPTS outlines into the frontier if they are within a configurable range.

If re-running the scope is problematic (especially when distributed crawling means separate crawl engines -- see #13 for a related example of this problem), we can use a simpler alternative. If the outline URL is not on same host as the Source (i.e. the seed), we add/append the scope+? annotation. This hardcodes the scoping as host + N hops but is probably acceptable in practice.