Open elgarteo opened 4 years ago
And other changes from my customized code that might be useful: 1) detecting whether the post contains any member-only content; 2) fetching the last level of the quoted text. The following lines go into .scrape_page()
:
##get_member_only?
private <- html %>% rvest::html_nodes("._36ZEkSvpdj_igmog0nluzh") %>%
rvest::html_node("div div div ._2cNsJna0_hV8tdMj3X6_gJ") %>%
rvest::html_node("._2yeBKooY3VAK8NLhM4Esov") %>%
rvest::html_text() %>%
is.na() %>%
not()
##get_quote
quote <- html %>% rvest::html_nodes("._36ZEkSvpdj_igmog0nluzh") %>%
rvest::html_node("div div div > ._31B9lsqlMMdzv-FSYUkXeV > *:last-child") %>%
rvest::html_text()
Thanks for your work again. I’ve been using this package extensively in the past few months and I even customized the code to fit my use case. One of the changes I made is on the page iteration. I found that page skipping based on detecting the next page button doesn’t seem very reliable. Thus, I modified it such that it detects the last page from the pagination menu and then iterates based on that. The “Empty last page” error doesn’t seem necessary anymore since I haven’t been getting that error after scraping over 400k+ posts with the modified code.
My customized code varies substantially from master so I’m applying the new method on your code and posting it here instead of making a pull request. Please test and feel free to adopt it if you think it’s useful.