justinchuntingho / LIHKGr

R Scraper for LIHKG, the Hong Kong version of Reddit.
GNU General Public License v3.0
16 stars 4 forks source link

Page iteration #8

Open elgarteo opened 4 years ago

elgarteo commented 4 years ago

Thanks for your work again. I’ve been using this package extensively in the past few months and I even customized the code to fit my use case. One of the changes I made is on the page iteration. I found that page skipping based on detecting the next page button doesn’t seem very reliable. Thus, I modified it such that it detects the last page from the pagination menu and then iterates based on that. The “Empty last page” error doesn’t seem necessary anymore since I haven’t been getting that error after scraping over 400k+ posts with the modified code.

My customized code varies substantially from master so I’m applying the new method on your code and posting it here instead of making a pull request. Please test and feel free to adopt it if you think it’s useful.

.scrape_post <- function(postid, remote_driver, verbose) {
  # Page 1
  attempt <- 1
  notdone <- TRUE
  while (notdone && attempt <= 4) { # Auto restart when fails
    .print_v(paste0("Attempt: ", attempt), verbose = verbose)
    attempt <- attempt + 1
    try({
      html <- .crack_it(paste0("https://lihkg.com/thread/", postid, "/page/1"), remote_driver)
      titlewords <- html %>% rvest::html_nodes("._2k_IfadJWjcLJlSKkz_R2- span") %>% rvest::html_text() %>% length()
      if (titlewords == 1) {
        notdone <- FALSE
        warning <- tibble::tibble(number = "ERROR", date = "ERROR", uid = "ERROR", probation = "ERROR", text = "ERROR", upvote = "ERROR", downvote = "ERROR", postid = postid, title = "Deleted Post", board = "ERROR", collection_time = Sys.time())
        .print_v("Empty Post, Skipping",  verbose = verbose)
        return(warning)
      }
      .print_v("Crawling page 1", verbose = verbose)
      post <- .scrape_page(html, postid)
      notdone <- FALSE
      .lay_low()
    })
  } # End of While Loop
  if (notdone && attempt > 4) {
    stop("Error, Stopping")
  }
  # Check total number of page
  last_page <- html %>% rvest::html_node("._1H7LRkyaZfWThykmNIYwpH option:last-child") %>%
    rvest::html_attr("value") %>%
    as.numeric()
  # Page 1 only
  if (last_page == 1) {
    .print_v("Finished crawling page 1 (last page)", verbose = verbose)
    return(post)
  }
  # Page 2+
  posts <- post
  .print_v("Finished crawling page 1 (to be continued)", verbose = verbose)
  for (i in 2:last_page) {
    attempt <- 1
    notdone <- TRUE
    while (notdone && attempt <= 4) { # Auto restart when fails
      .print_v(paste0("Attempt: ", attempt), verbose = verbose)
      attempt <- attempt + 1
      try({
        html <- .crack_it(paste0("https://lihkg.com/thread/", postid, "/page/", i), remote_driver)
        .print_v(paste0("Crawling page ", i, " of ", last_page), verbose = verbose)
        post <- .scrape_page(html, postid)
        posts <- dplyr::bind_rows(posts, post)
        notdone <- FALSE
        .lay_low()
      })
    } # End of While Loop
    if (notdone && attempt > 4) {  
      stop("Error, Stopping")
    }
    if (i == last_page) {
      .print_v(paste0("Finished crawling page ", i, " (last page)"), verbose = verbose)
    } else {
      .print_v(paste0("Finished crawling page ", i, " (to be continued)"), verbose = verbose)
    }
  }
  posts
}
elgarteo commented 4 years ago

And other changes from my customized code that might be useful: 1) detecting whether the post contains any member-only content; 2) fetching the last level of the quoted text. The following lines go into .scrape_page():

##get_member_only?
private <- html %>% rvest::html_nodes("._36ZEkSvpdj_igmog0nluzh") %>%
    rvest::html_node("div div div ._2cNsJna0_hV8tdMj3X6_gJ") %>%
    rvest::html_node("._2yeBKooY3VAK8NLhM4Esov") %>%
    rvest::html_text() %>%
    is.na() %>%
    not()
##get_quote
quote <- html %>% rvest::html_nodes("._36ZEkSvpdj_igmog0nluzh") %>%
    rvest::html_node("div div div > ._31B9lsqlMMdzv-FSYUkXeV > *:last-child") %>%
    rvest::html_text()