salimk / Rcrawler

An R web crawler and scraper
http://www.sciencedirect.com/science/article/pii/S2352711017300110
Other
350 stars 92 forks source link

Feature: Render javascript using splashr #42

Closed KnutJaegersberg closed 6 years ago

KnutJaegersberg commented 6 years ago

I found that by simply changing a single line in the linkextractor.R from readhtml to renderhtml from the splashr package, one can apparently crawl javascript enforcing sites, too.
Especially interesting is the combo with this docker image, making tor crawls optional too: https://github.com/TeamHG-Memex/aquarium

Would be nice to see this as optional in a future version. Or even better, mixing the framework with the interactivity options provided by Rselenium, but that would mean larger changes I guess. Anyway, this is as far I can see the most advanced scrapy competitor out there in the R language, would be nice to see it grow as well. Much better than rharvest.

salimk commented 6 years ago

Actually, we are working on this feature, but we have implemented R webdriver package/ phantomjs webdriver you will be notified when released. Thank you

salimk commented 5 years ago

Rcrawler v0.1.9 is released with a lot of features, subscribe to our mailing list to stay updated http://eepurl.com/dMv_7s