ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Pass DOM from WrenderProcessor along to the extractor(s) #28

Open anjackson opened 5 years ago

anjackson commented 5 years ago

Rather than skipping the rest of the chain, a successful WrenderProcessor should skip only FetchHTTP and let the rest of the chain run, especially the extractors so we have a chance of getting things we missed like srcset URLs.

Unfortunately, this likely means modifying or subclassing FetchHTTP itself so it only runs if there's no status code set (or otherwise infers it need not run).

It would also mean populating the CrawlURI properly

https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/modules/src/main/java/org/archive/modules/fetcher/FetchHTTP.java#L999-L1016

using the decoded renderedContent.

anjackson commented 5 years ago

Hm, not that easy. The Extractors etc. rely on accessing the Recorder.

            ReplayCharSequence cs = curi.getRecorder().getContentReplayCharSequence();

Also, when doing this, may need to take steps to avoid H3 writing it to the WARCs.

anjackson commented 5 years ago

Perhaps the post-crawl render-patching idea is a better one.