Open anjackson opened 5 years ago
Hm, not that easy. The Extractors etc. rely on accessing the Recorder.
ReplayCharSequence cs = curi.getRecorder().getContentReplayCharSequence();
Also, when doing this, may need to take steps to avoid H3 writing it to the WARCs.
Perhaps the post-crawl render-patching idea is a better one.
Rather than skipping the rest of the chain, a successful
WrenderProcessor
should skip onlyFetchHTTP
and let the rest of the chain run, especially the extractors so we have a chance of getting things we missed likesrcset
URLs.Unfortunately, this likely means modifying or subclassing
FetchHTTP
itself so it only runs if there's no status code set (or otherwise infers it need not run).It would also mean populating the
CrawlURI
properlyhttps://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/modules/src/main/java/org/archive/modules/fetcher/FetchHTTP.java#L999-L1016
using the decoded
renderedContent
.