hrbrmstr / decapitated

Headless 'Chrome' Orchestration in R
Other
65 stars 3 forks source link

About this endeavour #1

Open hrbrmstr opened 7 years ago

hrbrmstr commented 7 years ago

With proper "headless chrome" being "a thing" now — https://developers.google.com/web/updates/2017/04/headless-chrome — Chrome 59+ on anyone's system can be either instrumented at the cmdline or via the devtools protocol. Note that:

At the moment, Phantom also provides a higher level API than the DevTools Protocol.

is on the linked web page so I'm expecting the chrome team to provide direct "webdriver" support or a higher-level JS API like phantomjs has.

Enabling individual R users to "just use" their own instance of Chrome removes obstacles like Docker (tho this is a gd image https://github.com/ebidel/lighthouse-ci/blob/master/builder/Dockerfile) or virtual machines from the equation, so I'm unlikely to go down that route. I'm also not keen on building a version of chrome with "R" in it or R hooks in it since that means One More Thing to download.

Once/if webdriver support is added, this pkg might be moot. There's no guarantee for webdriver support tho.

Shorter-term goals are:

Longer-term goal is:

Depending on how much time I have (or if others want to pile on!) getting the Chrome DevTools protocol working for instrumentation is a goal. It looks event-oriented and may mean dealing with C[++] or C-wrapped R callbacks OR making an R orchestration DSL that translates into DevTools protocol "commands" and then just getting the result.

I personally only care about getting content back out, so unless someone who cares more about detailed instrumentation for creating — say — a test framework for htmlwidgets jumps on, I'm solely focused on enabling easier JS-based web-scraping (like I did with the splashr pkg).

hrbrmstr commented 7 years ago

(keeping running notes here)

Did a bit more investigation and I think the R DSL makes the most sense.

Am likely going to wrap https://github.com/dhbaird/easywsclient and see if I can't bang out something half-usable in short-order.

Basic tests with wscat shows it's super-easy to create the proper JSON DevTools websocket function calls that return immediate responses/JSON values. For the core "data gathering" tasks that would be the primary purpose of this pkg in R, such functionality is pretty straightforward.