hrbrmstr / splashr

:sweat_drops: Tools to Work with the 'Splash' JavaScript Rendering Service in R
Other
98 stars 5 forks source link

Can't pull multiple pages #4

Closed DavisBrian closed 7 years ago

DavisBrian commented 7 years ago

I'm trying to update a scraper I had that scrapes a site that switched to JavaScript. I've successfully used splashr to scrape a single page. The problem, most likely user error, I'm having is in scraping multiple "pages" at the site. (games.crossfit.com)

MWE

library(tidyverse)
library(lubridate)
library(stringr)
library(rvest)
library(splashr)

# set up the splasher docker container
install_splash()  
splash_svr <- start_splash()
pond <- splash("localhost")

# test to see if the server is active:
pond %>% splash_active()

base_url <- "https://games.crossfit.com/leaderboard?competition=1&year=2017&division=1&scaled=0&sort=0&fittest=1&fittest1=0&occupation=0&page="

url1 <- paste0(base_url, 1)
url2 <- paste0(base_url, 2)

page1 <- render_html(pond, url1) 
page2 <- render_html(pond, url2) 

# get the athletes names as test
# sometimes both work, sometimes one and sometimes, neither work
head(html_text(html_nodes(page1, css = "td .full-name")))
head(html_text(html_nodes(page2, css = "td .full-name")))

I'm new to web scraping so more than likely I'm simply going about this the wrong way.

hrbrmstr commented 7 years ago

I'll take a look at that in a minute (literally :-)

In the meantime, that site is backed by an API. As you deduced, it makes the table using javascript but it gets the data for the table via an XHR request that is pretty easily wrapped:

library(httr)
library(jsonlite)
library(tidyverse)

get_leaderboard <- function(year=2017L, division=1L, page=1L) {

  httr::GET(url = "https://games.crossfit.com",
            path = "/competitions/api/v1/competitions/open/2017/leaderboards",
            query = list(competition=1L, year=year, division=division, scaled=0,
                         sort=0, fittest=1, fittest1=0, occupation=0, page=page)) -> res

  res <- httr::stop_for_status(res)

  res <- httr::content(res, as="text", encoding="UTF-8")

  res <- jsonlite::fromJSON(res, simplifyDataFrame = TRUE, flatten = TRUE)

  res_cols <- cols(name = col_character(), userid = col_character(),
       overallrank = col_integer(), overallscore = col_integer(),
       regionid = col_integer(), region = col_character(), affiliateid = col_character(),
       affiliate = col_character(), height = col_character(), weight = col_character(),
       profilepic = col_character() )

  res$athletes <- readr::type_convert(res$athletes, col_types = res_cols)
  res$athletes <- tibble::as_tibble(res$athletes)

  res

}

It's a "paged" API, and here's how to call that ^^ to get the total pages (and to see what it returns):

first <- get_leaderboard()

first$currentpage
## [1] 1

first$totalpages
## [1] 3877

glimpse(first$athletes)
## Observations: 50
## Variables: 15
## $ name         <chr> "Matt Mcleod", "Jonathan Gibson", "Travis William...
## $ userid       <chr> "660204", "105058", "14960", "42717", "6190", "83...
## $ overallrank  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ overallscore <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ regionid     <int> 3, 5, 14, 5, 16, 10, 3, 13, 16, 3, 3, 12, 17, 16,...
## $ region       <chr> "Australia", "Canada West", "South Central", "Can...
## $ affiliateid  <chr> "2218", "0", "3092", "12721", "11838", "3040", "8...
## $ affiliate    <chr> "CrossFit Crossaxed", "Unaffiliated", "U Can Cros...
## $ age          <int> 24, 27, 25, 31, 30, 28, 24, 29, 34, 27, 26, 31, 2...
## $ highlight    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ height       <chr> "168 cm", "6'2\"", "5'9\"", "6'0\"", "5'5\"", "6'...
## $ weight       <chr> "80 kg", "220 lb", "202 lb", "190 lb", "175 lb", ...
## $ profilepic   <chr> "https://profilepicsbucket.crossfit.com/pukie.png...
## $ division     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ scores       <list> [<c("1", "--", "--", "--", "--"), c("--", "--", ...

Now, it's just a matter of retrieving all of them and binding the data frames together:

n <- 3 # make this the value from totalpages (see below the code block for more info)

pb <- progress_estimated(n)
map_df(1:n, function(pg) {
  pb$tick()$print() # progress meter when used interactively
  res <- get_leaderboard(page = pg)
  res <- res$athletes
  Sys.sleep(sample(3, 7, 1)) # pretend you're a human so you don't get IP banned
  res
}) -> athletes

glimpse(athletes)
## Observations: 150
## Variables: 15
## $ name         <chr> "Matt Mcleod", "Jonathan Gibson", "Travis William...
## $ userid       <chr> "660204", "105058", "14960", "42717", "6190", "83...
## $ overallrank  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ overallscore <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ regionid     <int> 3, 5, 14, 5, 16, 10, 3, 13, 16, 3, 3, 12, 17, 16,...
## $ region       <chr> "Australia", "Canada West", "South Central", "Can...
## $ affiliateid  <chr> "2218", "0", "3092", "12721", "11838", "3040", "8...
## $ affiliate    <chr> "CrossFit Crossaxed", "Unaffiliated", "U Can Cros...
## $ age          <int> 24, 27, 25, 31, 30, 28, 24, 29, 34, 27, 26, 31, 2...
## $ highlight    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ height       <chr> "168 cm", "6'2\"", "5'9\"", "6'0\"", "5'5\"", "6'...
## $ weight       <chr> "80 kg", "220 lb", "202 lb", "190 lb", "175 lb", ...
## $ profilepic   <chr> "https://profilepicsbucket.crossfit.com/pukie.png...
## $ division     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ scores       <list> [<c("1", "--", "--", "--", "--"), c("--", "--", ...

NOTE that the scores column is a list column. It seems to have the scores that are in those 17.1, 17.2 (etc) table fields on the site. You can likely tidyr::unnest() it.

I really wouldn't change the Sys.sleep(…) line as the site is likely checking for non-human use and you're getting this data for free, so suffering a bit with time delay seems like a fair trade.

The Terms of Service link was not clickable for me in Chrome on macOS (none of the bottom links were, strangely enough). I point that out since if it is clickable for you, then you should def read it to make sure "scraping" or accessing this data via API is legal/permissible/ethical.

Even though I suggested changing n to the API-derived max pages, I'd likely do this in chunks and save off each chunk to an RDS file, then put it all back together. ~4K pages is quite a bit and internet scraping is full of hiccups and it'll save you time and them bandwidth/CPU if you design the scraping component in a friendly way. You're likely going to trigger an automated access warning even with the Sys.sleep(…) so this is also a safer way to go in the event your IP does get time-banned.

With ^^ in mind I'd also likely change this slightly to wrap GET with purrr::safely() and deal with the modified return value accordingly to handle cases where API calls fail.

Extending the above API function wrapper to handle all of the parameters available from:

image

[w|sh]ould not be too hard.

hrbrmstr commented 7 years ago

W/r/t the actual problem you had, the issue was not giving the page time to render. The XHR request takes a tiny bit of time and they use a javascript library that is a bit slow on the rendering part, so you have to pause before taking the HTML snapshot of it:

library(rvest)
library(splashr)
library(tidyverse)

n <- 3 # it's 3877 as shown in the previous reply block but I'm not waiting around for that :-)

url_template <- "https://games.crossfit.com/leaderboard?competition=1&year=2017&division=1&scaled=0&sort=0&fittest=1&fittest1=0&occupation=0&page=%s"

url_vec <- sprintf(url_template, 1:n)

splash_active()
## [1] TRUE

pb <- progress_estimated(n)
map_df(url_vec, function(cf_url) {

  pb$tick()$print()

  # NOTE: it's likely render_html(cf_url, wait=1.5) would also work here 
  #       but I used this as an opportunity to exercise the new DSL functions

  splash_local %>%
    splash_go(cf_url) %>%
    splash_wait(1.5) %>% # give the page time to load the XHR data and render the table
    splash_html() -> pg

  html_nodes(pg, "table.athletes") %>%
    html_table(header = TRUE) %>%
    .[[1]] %>%
    set_names(c("pos", "name", "total_pts", "x1", "x2", "x3", "x4", "x5")) %>%
    as_tibble()

}) -> scraped_df

glimpse(scraped_df)
## Observations: 150
## Variables: 8
## $ pos       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ name      <chr> "Matt McleodAustralia24168 cm80 kgView Profile", "Jo...
## $ total_pts <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ x1        <chr> "1(10:28)225 repsJudged by Oliver Wilsonat CrossFit ...
## $ x2        <chr> "--(--)", "--(--)", "--(--)", "--(--)", "--(--)", "-...
## $ x3        <chr> "--(--)", "--(--)", "--(--)", "--(--)", "--(--)", "-...
## $ x4        <chr> "--(--)", "--(--)", "--(--)", "--(--)", "--(--)", "-...
## $ x5        <chr> "--(--)", "--(--)", "--(--)", "--(--)", "--(--)", "-...

You'd likely not be able to use html_table(…) on this directly given how badly name and the various x columns are formatted.

You also don't really have the foreknowledge abt the # of pages with this method, so you'd have to also initially scrape the last # if the table navigation/paging footer:

image

I'd go the API route, especially since it's much friendlier on their servers (it only has to return the API content, not all the images, etc) 😀

hrbrmstr commented 7 years ago

Forgot to add that you can find the XHR requests via Developer Tools in your browser, or you can use splashr to find them:

library(splashr)
library(tidyverse)

url_template <- "https://games.crossfit.com/leaderboard?competition=1&year=2017&division=1&scaled=0&sort=0&fittest=1&fittest1=0&occupation=0&page=%s"
cf_url <- sprintf(url_template, 1)

splash_active()
## [1] TRUE

splash_local %>%
  splash_response_body(TRUE) %>%
  splash_go(cf_url) %>%
  splash_wait(1.5) %>%
  splash_har() -> cf_har

cf_har %>%
  har_entries() %>%
  keep(is_xhr) %>%
  keep(is_json) %>%
  map_chr(c("request", "url"))
## [1] "https://games.crossfit.com/cf/global-menu?login"                                                                                                                               
## [2] "https://games.crossfit.com/cf/global-menu?exercise"                                                                                                                            
## [3] "https://games.crossfit.com/cf/global-menu?games"                                                                                                                               
## [4] "https://games.crossfit.com/cf/global-menu?journal"                                                                                                                             
## [5] "https://games.crossfit.com/cf/global-menu?affiliates"                                                                                                                          
## [6] "https://games.crossfit.com/competitions/api/v1/competitions/open/2017?expand[]=controls&expand[]=workouts"                                                                     
## [7] "https://games.crossfit.com/competitions/api/v1/competitions/open/2017/leaderboards?competition=1&year=2017&division=1&scaled=0&sort=0&fittest=1&fittest1=0&occupation=0&page=1"
## [8] "https://games.crossfit.com/competitions/api/v1/competitions/open/2017/leaderboards?competition=1&year=2017&division=1&scaled=0&sort=0&fittest=1&fittest1=0&occupation=0&page=1"
hrbrmstr commented 7 years ago

One more thing!

I figured out what was causing the "Terms & Conditions" link to not be clickable. The HTML that creates the "shell" of the site was creating a <div> that was intercepting mouse movement and clicks.

This is the T&C URL: https://games.crossfit.com/cf/terms-and-conditions

I won't delete this issue since I tried to get to the T&C before helping & couldn't but it was nagging at me that they would have those link-ish things there in the footer but not make them clickable.

Unfortunately (for you) it contains:

No Site content may be modified, distributed, framed, copied, reproduced, republished, downloaded, scraped, displayed, posted, transmitted, licensed, bartered, leased or sold in any form or by any means, in whole or in part, other than as expressly permitted in these Terms of Use or as expressly authorized in writing by CrossFit.

So, I'll just ask that you not use my code or packages in performing an — at worst — illegal and — at least — unethical action.

DavisBrian commented 7 years ago

Thanks for the extensive response. I wasn't aware of the T&C. This is really just a smallish hobby project (mainly the people at our gym), and I wasn't intending to scrape everything. I was hoping it fell under fair use but I'll need to think about it.