rstudio / webinars

Code and slides for RStudio webinars
https://resources.rstudio.com/webinars
1.53k stars 1.41k forks source link

zeor list #53

Open ehsannu opened 6 years ago

ehsannu commented 6 years ago

I want to download all links/ titles of papers from the web using rvest. I used the following script but it is not the list is zero. Any suggestions?

library(rvest)

Download the HTML and turn it into an XML file with read_html() Papers <- read_html("https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=1&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false")

Extract specific nodes with html_nodes() Titles <- html_nodes(Papers, "span.optClickTitle")

gueyenono commented 6 years ago
library(rvest)

webpage <- read_html("https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=1&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false")

title <- webpage %>%
    html_nodes(".optClickTitle") %>%
    html_text()

links <- webpage %>%
    html_nodes(".optClickTitle") %>%
    html_attr("href")

info <- data.frame(title, links)
info
ehsannu commented 6 years ago

Thanks a lot! it works but it just scraps the records from the first page. Any suggestions?

gueyenono commented 6 years ago

You want everything for all 219 pages?

ehsannu commented 6 years ago

Yes!

gueyenono commented 6 years ago

Be aware, the code is going to run for quite a bit of time. I recommend you export the resulting data frame right away as a csv file or whatever format you prefer.

library(rvest)
library(purrr)

scrape_paper_info <- function(link){

    node_of_interest <- read_html(link) %>%
        html_nodes(".optClickTitle")

    data.frame(
        title = html_text(node_of_interest, "title"),
        link = html_attr(node_of_interest, "href")
    )
}

links <- paste0("https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=",
                     1:219,
                     "&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false")

paper_info <- map_dfr(links, scrape_paper_info)