Optional headless browser scrape

tomcardoso commented 4 years ago

In my newsroom (in Canada), we're currently scraping a few COVID-19 provincial pages. One of them uses Angular to render the page, so Klaxon as it is currently isn't going to work. This will require a headless browser.

We have 100+ pages being watched for all sorts of reasons right now, so I don't think it makes sense to just rip out the current Net::HTTP-based scraper and replace it with something like webdrivers. The scrape would become far too slow.

Instead, I'm thinking of making some tweaks that would allow for an optional headless browser scrape on a per-page level. All it would take is:

A change to the Pages model to add a use_headless_browser logical field
A checkbox in the view for pages that renders that logical field
A change to the html method in the Pages model to optionally use a headless browser if that field is true

I can probably write this this weekend, we can dogfood it, and if it works I can submit a PR. Would that be of interest?

tomcardoso commented 4 years ago

@GabeIsman @tommeagher I've now built this out. Turned out to be pretty straightforward. It's working on our own instance, too. Happy to share the code or prep a PR of some sort if you'd like… let me know.

tommeagher commented 4 years ago

@tomcardoso our motto at The Marshall Project is "PRs gladly accepted." If you have this running, we'd be happy to take a look and if all is in order to merge it. This project really benefits from contributions from the community. If this approach is working for you, we'd love to see it. Sounds like it'd be helpful to a lot of others too.

tomcardoso commented 4 years ago

@tommeagher Pull request is now available at https://github.com/themarshallproject/klaxon/pull/318.

tommeagher commented 8 months ago

Hi all. With the recent release of Klaxon Cloud, we're going back and revisiting old issues that we're not going to pursue or support as we consider any future development of the original standalone Klaxon. This one (almost 4 years old now) falls in that bucket. Thanks for the contributions and discussions on this, but we'll close it as WONTFIX.

themarshallproject / klaxon

Optional headless browser scrape #317