[Question] Where should i set the content obtained from http request ?

internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Other

2.77k stars 757 forks source link

I am extending this module of heritrix org.archive.modules.fetcher.FetchHTTP and overriding the innerProcess method to make a headless browser get the content instead of the builtin heritrix http request


    @Override
    protected void innerProcess(CrawlURI curi) throws InterruptedException { }

i read through the source of FetchHTTP module, but unable to figure out where this method actually sets the content obtained from the request.

    protected void addResponseContent(HttpResponse response, CrawlURI curi) {
        curi.setFetchStatus(response.getStatusLine().getStatusCode());
        Header ct = response.getLastHeader("content-type");
        curi.setContentType(ct == null ? null : ct.getValue());

        for (Header h: response.getAllHeaders()) {
            curi.putHttpResponseHeader(h.getName(), h.getValue());
        }
    }

the above method is called when the http request status is success, here i couldnt find any setters to set the content obtained from a URL ( for example, a html page ).

How can i set the html content, so that heritrix can proceed to extract the links from it ?

Assuming your content is supplied by a InputStream called stream then something like this will probably work:

Recorder recorder = curi.getRecorder();
recorder.markContentBegin();
recoredr.inputWrap(stream);
recorder.getRecordedInput().readFully();
recorder.closeRecorders();

handleCapturedRequest() in ExtractorChrome may be a relevant example of integrating Heritrix with a headless browser. Although keep in mind that's for recording subrequests on a background thread and so has to jump through a lot more hoops. Whereas since since you're writing a Fetch processor you don't have to setup your own recorder and can use the one already supplied by the ToeThread and similarly don't need to call the extractors yourself.

internetarchive / heritrix3

[Question] Where should i set the content obtained from http request ? #438