It would be handy to have a scoping mechanism that let the crawl run N hopes from the scope, ignoring redirects etc.
So, imagine we modify the outgoing links in a post-processor module, using a scope+L annotation to track how far off the original scope we are. Ideally, we could:
Remove (but remember) any scope+? annotation from the outlink.
Run the outline through the scope, and see if it would get accepted.
If it would not get accepted, append the current hop to the scope+ annotation.
If we have a useful scope annotation, e.g. scope+L, add it to the outline.
Handle the outlink as normal.
This makes it possible to track how far off scope we are, but only works if we ALSO add a new decide rule that uses the scope+? annotation and ACCEPTS outlines into the frontier if they are within a configurable range.
If re-running the scope is problematic (especially when distributed crawling means separate crawl engines -- see #13 for a related example of this problem), we can use a simpler alternative. If the outline URL is not on same host as the Source (i.e. the seed), we add/append the scope+? annotation. This hardcodes the scoping as host + N hops but is probably acceptable in practice.
It would be handy to have a scoping mechanism that let the crawl run
N
hopes from the scope, ignoring redirects etc.So, imagine we modify the outgoing links in a post-processor module, using a
scope+L
annotation to track how far off the original scope we are. Ideally, we could:scope+?
annotation from the outlink.scope+
annotation.scope+L
, add it to the outline.This makes it possible to track how far off scope we are, but only works if we ALSO add a new decide rule that uses the
scope+?
annotation andACCEPTS
outlines into the frontier if they are within a configurable range.If re-running the scope is problematic (especially when distributed crawling means separate crawl engines -- see #13 for a related example of this problem), we can use a simpler alternative. If the outline URL is not on same host as the Source (i.e. the seed), we add/append the
scope+?
annotation. This hardcodes the scoping ashost + N hops
but is probably acceptable in practice.