The main problem is that H3 relies on queue prioritization to make sure pre-requisites are crawled, in contrast to may other crawlers that handle things like DNS or robots.txt outside of the crawl frontier. When H3 finds a pre-requisite it pushes the current URL back into the queue and enqueues the pre-requisite so that it will be dequeued first. This can be done with url-frontier, although I think it's taking advantage of a grey area in the API spec. and so it's not clear if the behaviour would be immediately portable to other implementations.
Notes:
I think it should be possible to code this so that the URL Frontier can be either embedded directly or accessed over GRPC (at least once https://github.com/crawler-commons/url-frontier/issues/42 is implemented). Having a fully local option might aid uptake for institutions that don't like running this kind of thing as a service suite.
The new Crawl-ID field could be used to share the instance with multiple jobs, and even allow clients to shift URLs between crawlers.
As with the Redis implementation, to fully and transparently integrate into Heritrix as-is, it is necessary to store the (e.g. Kryo) serialised CrawlURI in it's entirety. This is pretty horrible and not really in the spirit of using an external frontier, but it likely an unavoidable arrangement, at least for now.
Building on the experience with the Redis-based frontier, it should be possible to build a frontier based on url-frontier. The rough outline of the approach is in this discussion: https://github.com/crawler-commons/url-frontier/discussions/12#discussioncomment-1229076
The main problem is that H3 relies on queue prioritization to make sure pre-requisites are crawled, in contrast to may other crawlers that handle things like DNS or robots.txt outside of the crawl frontier. When H3 finds a pre-requisite it pushes the current URL back into the queue and enqueues the pre-requisite so that it will be dequeued first. This can be done with url-frontier, although I think it's taking advantage of a grey area in the API spec. and so it's not clear if the behaviour would be immediately portable to other implementations.
Notes:
See also https://github.com/crawler-commons/url-frontier/discussions/45