Closed tomcardoso closed 8 months ago
@GabeIsman @tommeagher I've now built this out. Turned out to be pretty straightforward. It's working on our own instance, too. Happy to share the code or prep a PR of some sort if you'd like… let me know.
@tomcardoso our motto at The Marshall Project is "PRs gladly accepted." If you have this running, we'd be happy to take a look and if all is in order to merge it. This project really benefits from contributions from the community. If this approach is working for you, we'd love to see it. Sounds like it'd be helpful to a lot of others too.
@tommeagher Pull request is now available at https://github.com/themarshallproject/klaxon/pull/318.
Hi all. With the recent release of Klaxon Cloud, we're going back and revisiting old issues that we're not going to pursue or support as we consider any future development of the original standalone Klaxon. This one (almost 4 years old now) falls in that bucket. Thanks for the contributions and discussions on this, but we'll close it as WONTFIX.
In my newsroom (in Canada), we're currently scraping a few COVID-19 provincial pages. One of them uses Angular to render the page, so Klaxon as it is currently isn't going to work. This will require a headless browser.
We have 100+ pages being watched for all sorts of reasons right now, so I don't think it makes sense to just rip out the current
Net::HTTP
-based scraper and replace it with something likewebdrivers
. The scrape would become far too slow.Instead, I'm thinking of making some tweaks that would allow for an optional headless browser scrape on a per-page level. All it would take is:
use_headless_browser
logical fieldhtml
method in the Pages model to optionally use a headless browser if that field is trueI can probably write this this weekend, we can dogfood it, and if it works I can submit a PR. Would that be of interest?