nrennie / shakespeare

CSV files of the works of William Shakespeare.
5 stars 0 forks source link

Looping through plays to scrape all at once #6

Open havishak opened 1 month ago

havishak commented 1 month ago

Hi @nrennie,

This is a cool dataset, good work! As I was reviewing the code, I wondered if you could loop through the plays to scrape all together. Something like this -

library(polite)
library(rvest)
library(tidyverse)

# start scraping session 
website_link <- "https://shakespeare.mit.edu/"
start_session <- bow(website_link)

all_works <- scrape(start_session) %>%
    html_table() %>% 
    .[[2]] %>% #it's the second table, and making tidy
    rename(
        "comedy" = "X1", 
        "history" = "X2",
        "tragedy" = "X3",
        "poetry" = "X4"
    ) %>%
    filter(comedy != "Comedy") %>%
    pivot_longer(
        cols = everything(),
        names_to = "genre",
        values_to = "works"
    ) %>%
    separate_longer_delim(cols = works, delim = "\n") %>%
    filter(!works %in% c("", "The")) %>%
    mutate(
        works = ifelse(works == "Merry Wives of Windsor", "The Merry Wives of Windsor", works),
       # now, get the links for the tables
        work_link = scrape(start_session) %>%
            html_elements("table") %>%
            .[[2]] %>%
            html_elements("a") %>%
            html_attr("href"),
       # adding website link
        work_link = paste0(website_link, work_link)
    ) 

# full play scripts exist for everything except 'Poetry'

plays_to_scrape = all_works %>%
    filter(genre != "poetry") %>%
   # changing website address to the full play
    mutate(work_link = gsub("/index.html","/full.html", work_link ))

# loop through plays_to_scrape

map(plays_to_scrape$work_link, extract_data)
nrennie commented 1 month ago

Hi @havishak

Thanks for this! 🎉 Yes, you definitely could do something like this (and it is on my list of things to do), but there were a few reasons I haven't done it yet:

Thanks for the suggestion and code - I'll likely come back to this issue when I'm ready to run it on everything so I'll leave it open 🎉

A couple of notes:

havishak commented 1 month ago

Thanks for the notes!