ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Create a url-frontier Frontier implementation #80

Open anjackson opened 2 years ago

anjackson commented 2 years ago

Building on the experience with the Redis-based frontier, it should be possible to build a frontier based on url-frontier. The rough outline of the approach is in this discussion: https://github.com/crawler-commons/url-frontier/discussions/12#discussioncomment-1229076

The main problem is that H3 relies on queue prioritization to make sure pre-requisites are crawled, in contrast to may other crawlers that handle things like DNS or robots.txt outside of the crawl frontier. When H3 finds a pre-requisite it pushes the current URL back into the queue and enqueues the pre-requisite so that it will be dequeued first. This can be done with url-frontier, although I think it's taking advantage of a grey area in the API spec. and so it's not clear if the behaviour would be immediately portable to other implementations.

Notes:

See also https://github.com/crawler-commons/url-frontier/discussions/45

jnioche commented 2 years ago

@anjackson https://github.com/crawler-commons/url-frontier/issues/42 has been fixed, version 1.1 of the URLFrontier service is now available as a Maven dependency.