tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 341 forks source link

read_html_live() practical implementation #397

Closed rcepka closed 7 months ago

rcepka commented 8 months ago

Hello, thank you for this excellent package and for the newest addition - read_html_live(). This was a very needed feature for scraping javascript based websites. I dont exactly understand how this new function works and I am trying to figure out its implementation into the scraping workflow. So if it isn`t outside of scope of your regular users support, I would appreciate advices on this topics from you.

What I currently expect from my web scraping solution is mainly this:

Below is my simplified code, the way I am doing it now:


scrape_page <- function(link, usr_agent, scraping_repeat, ...) {

    sleep_time <<- runif(1, sys_sleep_time_from, sys_sleep_time_to)
    Sys.sleep(sleep_time)

    # Set initial values
    response <- NULL
    attempts <<- 1

    #
    # Main loop
    #

    while (response_code != 200  &  attempts <= scraping_repeat) {

      # Call this before each "GET"
      proxy_number <<- get_proxy_number(proxies_list = proxies_list, proxy_selection = proxy_selection)
      usr_agent <<- sample(user_agents_list, 1)

      tryCatch({
        response <- GET(
          link,
          user_agent(usr_agent),
          use_proxy(
            url = proxies_list$address[proxy_number],
            port = as.numeric(proxies_list$port[proxy_number]),
            username = proxies_list$username[proxy_number],
            password = proxies_list$pass[proxy_number]
          )
        )

        response_code <<- response$status_code

      },
      # Error handling
      error = function(e){
        logger::log_error("Fun scrape_page:  The page could not be scraped, link: {link}")
      }
      )

    # Repeat scraping if needed
      if(response_code != 200) {
        attempts <<- attempts + 1
        wait_time <- scraping_repeat_wait_time * attempts
        Sys.sleep(wait_time)
        }

    #
    # End of main loop
    #
    }

    return(response)

  }

My questions:

Many thanks in advance for any advices, hints or opinions...

hadley commented 8 months ago
  1. Ability to change user agents is tracked in #388
  2. It looks like using a proxy requires setting some command line flags. That's going to require quite a lot of plumbing, so is unlikely to be something I tackle until a few people have requested it.
  3. I'm currently not sure how we'll expose browser errors to R.