ukwa / webrender-api

A RESTful API for rendering web pages
0 stars 2 forks source link

Replace with browsertrix-crawler #9

Open anjackson opened 3 years ago

anjackson commented 3 years ago

Rather than our own webrender-api, consider switching to https://github.com/webrecorder/browsertrix-crawler

The integration pattern is somewhat different to Browsertrix's primary use case, but it could work very well as a page rendering service, integrated into our crawl system. i.e. Heritrix calls a Browsertrix API, and gets the links back for enqueing. That service emits WARCs and crawl log events that allow us to integrate with Heritrix's crawl state and the crawl-time access service.

To match functionality, this would require:

Additional things we don't have yet but would like:

ikreymer commented 2 years ago

Thanks for making this list! It looks like a few of these are either already in place, or within reach.

Some comments below.

BC already does link extraction, could this be expanded from that perhaps? This could just be a json response, per page?

There is a PR that adds this functionality, but wanted to confirm the spec / align with what you're looking for: webrecorder/browsertrix-crawler#40 Maybe can add the exact requirements to the issue so can track them better.

  • User Agent override.

Supported.

Supported.

Not yet supported, but would be easy to add as extra config options.

  • clickKnownModals to get rid of things like cookie popus before scrolling down.

This would be a great addition to the autoscroll behavior in browsertrix-behaviors and/or for profile creation! I guess a big complication would be non-English modals...

  • Ability to post crawl log to OutbackCDX, and to emit a crawl log e.g. via Kafka or Redis Streams (currently done via warcprox plugins).

What are you looking for in a crawl log? Started this issue to discuss there, I imagine it could be useful more broadly: webrecorder/browsertrix-crawler#74

  • The current implementation uses the Warcprox-Meta: { 'warc-prefix': to match up the WARC prefix with the crawl job. But we can perhaps live without that (as it's brittle/difficult to ensure all requests get the header anyway).

I believe this is configurable via the pywb config.yaml, can also be added as a config option, or as an addition to --combineWARC. Actually, combineWARC already sets the filename prefix to the collection name, so would be easy to add.

Additional things we don't have yet but would like:

About half-way there, we have device emulation settings and dedup via redis.

  • Playback mode (but screenshot not WARCd, just returned for use). Supporting Memento Datetime header.

Supported (just running browsertrix-crawler with different command).

  • Virus scanning and redirecting matches to a different WARC (maybe settle for scanning and logging actually, given false positives).

Need more info on this one :)

anjackson commented 2 years ago

When I originally looked at this, the UKWA implementation was so far from how Browsertrix-Crawler works, that I couldn't really make sense of how to proceed. So, in the meantime, I started adding the API service I needed to UKWA's webrender-puppeteer codebase. This is currently working okay, i.e. it integrates with Heritrix, but it'd not been put into production yet.

The server is launched like this:

node server.js

Where this is server.js. As you can see, this partially duplicates things like the puppeteer-cluster code in your implementation.

It then calls my renderer.js:render_page, which is kinda like your behaviour scripts, and includes viewport fiddling, screenshots, and other formats (I dropped thumbnails as I think it's better to generate those at access time, and I didn't want to add an image manipulation dependency, but YMMV).

It adds WARC records for:

The code then wraps the HAR with some useful information and this is passed back to H3 as JSON:

https://github.com/ukwa/webrender-puppeteer/blob/1aed0423b2abd05daf41036e761a7b26da240fd1/renderer.js#L340-L344

So, doing this has made the alignment a little clearer, but I'm still not sure how best to bludgeon/neatly-integrate this all into place.

ADDITIONAL: A general point here is that for my purposes this needs to be a long running API service and so rendering options would be passed in as API parameters rather than startup options.