tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 343 forks source link

`read_html_live` needs some time after returning its result to allow `html_elements` to work properly #428

Open Feat-FeAR opened 3 weeks ago

Feat-FeAR commented 3 weeks ago

Thank you first of all for the development of this useful package. Today, I have experienced a strange behavior from the read_html_live() function, whereby if I run my script line by line from R Studio, and slowly, I can then use html_elements() to retrieve the elements from the HTML page correctly, but if I source the script (or even if I run all the lines individually, but quickly!) html_elements() just returns NAs, as if the contents of the variable returned by read_html_live() are not yet available... (even if the variable is already stored in the global environment!)

Here is my minimal reproducible example where I retrieve 'F1000Research' best percentile from Scopus web site. I need scraping because such information is not provided by the API)

This just returns NAs:

journal_url <- "https://www.scopus.com/sourceid/21100258853"
page <- read_html_live(journal_url)
page |> html_elements("td:nth-child(1) div") |> html_text() -> category
best_category <- category[2]
page |> html_elements("td:nth-child(3) div div") |> html_text() -> percent
best_percentile <- percent[3]
cat("Category:", best_category, "\nPercentile:", best_percentile)

However this works (even when sourcing the entire script):

journal_url <- "https://www.scopus.com/sourceid/21100258853"
page <- read_html_live(journal_url)

Sys.sleep(1) # <----- just give him some time

page |> html_elements("td:nth-child(1) div") |> html_text() -> category
best_category <- category[2]
page |> html_elements("td:nth-child(3) div div") |> html_text() -> percent
best_percentile <- percent[3]
cat("Category:", best_category, "\nPercentile:", best_percentile)

¯(°_o)/¯

My sessionInfo:

> sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default