Open anjackson opened 3 years ago
Thanks for making this list! It looks like a few of these are either already in place, or within reach.
Some comments below.
- Adding an Express server mode that supports the API, returning URLs marked as embeds or links. See WrenderProcessor.processHar
BC already does link extraction, could this be expanded from that perhaps? This could just be a json response, per page?
- Supporting screen shots (as per [Support Screenshot Creation webrecorder/browsertrix-crawler#11] (https://github.com/webrecorder/browsertrix-crawler/issues/11)) but also pdf, thumbnail, imagemap, onreadydom, har.
There is a PR that adds this functionality, but wanted to confirm the spec / align with what you're looking for: webrecorder/browsertrix-crawler#40 Maybe can add the exact requirements to the issue so can track them better.
- User Agent override.
Supported.
- WARC file rotation as per Post-process smaller WARCs to larger WARC with warcinfo webrecorder/browsertrix-crawler#14
Supported.
Not yet supported, but would be easy to add as extra config options.
- clickKnownModals to get rid of things like cookie popus before scrolling down.
This would be a great addition to the autoscroll behavior in browsertrix-behaviors and/or for profile creation! I guess a big complication would be non-English modals...
- Ability to post crawl log to OutbackCDX, and to emit a crawl log e.g. via Kafka or Redis Streams (currently done via warcprox plugins).
What are you looking for in a crawl log? Started this issue to discuss there, I imagine it could be useful more broadly: webrecorder/browsertrix-crawler#74
- The current implementation uses the
Warcprox-Meta: { 'warc-prefix':
to match up the WARC prefix with the crawl job. But we can perhaps live without that (as it's brittle/difficult to ensure all requests get the header anyway).
I believe this is configurable via the pywb config.yaml, can also be added as a config option, or as an addition to --combineWARC. Actually, combineWARC already sets the filename prefix to the collection name, so would be easy to add.
Additional things we don't have yet but would like:
- Patch mode, including device switching option.
About half-way there, we have device emulation settings and dedup via redis.
- Playback mode (but screenshot not WARCd, just returned for use). Supporting Memento Datetime header.
Supported (just running browsertrix-crawler with different command).
- Virus scanning and redirecting matches to a different WARC (maybe settle for scanning and logging actually, given false positives).
Need more info on this one :)
When I originally looked at this, the UKWA implementation was so far from how Browsertrix-Crawler works, that I couldn't really make sense of how to proceed. So, in the meantime, I started adding the API service I needed to UKWA's webrender-puppeteer
codebase. This is currently working okay, i.e. it integrates with Heritrix, but it'd not been put into production yet.
The server is launched like this:
node server.js
Where this is server.js
. As you can see, this partially duplicates things like the puppeteer-cluster
code in your implementation.
It then calls my renderer.js:render_page
, which is kinda like your behaviour scripts, and includes viewport fiddling, screenshots, and other formats (I dropped thumbnails as I think it's better to generate those at access time, and I didn't want to add an image manipulation dependency, but YMMV).
It adds WARC records for:
The code then wraps the HAR with some useful information and this is passed back to H3 as JSON:
So, doing this has made the alignment a little clearer, but I'm still not sure how best to bludgeon/neatly-integrate this all into place.
ADDITIONAL: A general point here is that for my purposes this needs to be a long running API service and so rendering options would be passed in as API parameters rather than startup options.
Rather than our own
webrender-api
, consider switching to https://github.com/webrecorder/browsertrix-crawlerThe integration pattern is somewhat different to Browsertrix's primary use case, but it could work very well as a page rendering service, integrated into our crawl system. i.e. Heritrix calls a Browsertrix API, and gets the links back for enqueing. That service emits WARCs and crawl log events that allow us to integrate with Heritrix's crawl state and the crawl-time access service.
To match functionality, this would require:
Warcprox-Meta: { 'warc-prefix':
to match up the WARC prefix with the crawl job. But we can perhaps live without that (as it's brittle/difficult to ensure all requests get the header anyway).Additional things we don't have yet but would like: