omegahat / XML

The XML package for R
Other
20 stars 11 forks source link

pull in html data with extra java reveal? #8

Open hardin47 opened 8 years ago

hardin47 commented 8 years ago

I want to pull in data from 538, but I want the full data which is arrived at by clicking on "Show more polls"... Is there any way for the function to access the additional lines of the table?

http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/

The code for pulling in the top level data is:

require(XML) polls.html <- htmlTreeParse("http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/", useInternalNodes = TRUE) parsedDoc <- readHTMLTable(polls.html, stringsAsFactors=FALSE) pollData <- data.frame(parsedDoc[4])

duncantl commented 8 years ago

Hi Jo

I believe the data for for all 530 polls is not directly in a

in the HTML so you won't find it that way. Instead, that content is dynamically constructed using the data that is contained in a script node. The following is a specific way of doing it that could be generalized if necessary.

library(XML)

url = "http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/"
doc <- htmlParse(url)

sc = getNodeSet(doc, "//script[contains(., 'race.model')]")
js = xmlValue(sc)

jsobj = gsub(".*race.stateData = (.*);race.pathPrefix.*", "\\1", js)

library(RJSONIO)

data = fromJSON(jsobj)
names(data)
length(data$polls) # The 534 we want.