themarshallproject / klaxon

Klaxon enables reporters and editors to monitor scores of sites on the web for newsworthy changes.
https://newsklaxon.org
MIT License
646 stars 199 forks source link

Optional headless browser scrape #317

Closed tomcardoso closed 8 months ago

tomcardoso commented 4 years ago

In my newsroom (in Canada), we're currently scraping a few COVID-19 provincial pages. One of them uses Angular to render the page, so Klaxon as it is currently isn't going to work. This will require a headless browser.

We have 100+ pages being watched for all sorts of reasons right now, so I don't think it makes sense to just rip out the current Net::HTTP-based scraper and replace it with something like webdrivers. The scrape would become far too slow.

Instead, I'm thinking of making some tweaks that would allow for an optional headless browser scrape on a per-page level. All it would take is:

I can probably write this this weekend, we can dogfood it, and if it works I can submit a PR. Would that be of interest?

tomcardoso commented 4 years ago

@GabeIsman @tommeagher I've now built this out. Turned out to be pretty straightforward. It's working on our own instance, too. Happy to share the code or prep a PR of some sort if you'd like… let me know.

tommeagher commented 4 years ago

@tomcardoso our motto at The Marshall Project is "PRs gladly accepted." If you have this running, we'd be happy to take a look and if all is in order to merge it. This project really benefits from contributions from the community. If this approach is working for you, we'd love to see it. Sounds like it'd be helpful to a lot of others too.

tomcardoso commented 4 years ago

@tommeagher Pull request is now available at https://github.com/themarshallproject/klaxon/pull/318.

tommeagher commented 8 months ago

Hi all. With the recent release of Klaxon Cloud, we're going back and revisiting old issues that we're not going to pursue or support as we consider any future development of the original standalone Klaxon. This one (almost 4 years old now) falls in that bucket. Thanks for the contributions and discussions on this, but we'll close it as WONTFIX.