tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 341 forks source link

rvest fails to parse HTML page from google scholar; returns `xml_nodeset (0)` #383

Closed Rohit-Satyam closed 8 months ago

Rohit-Satyam commented 8 months ago

Hi @hadley

rvest v 1.0.3 was working just fine a few hours ago and suddenly it stopped working. I was trying to scrap some paper titles using a gene ID on Google Scholar and it was doing a good job but now I get xml_nodeset (0)

    url <- paste0("https://scholar.google.com/scholar?q=", "PF3D7_0420300")

    # Fetch the HTML content of the search results page
    page <- rvest::read_html(url)

    # Extract titles and publication years
    titles <- page %>%
      rvest::html_nodes(".gs_rt a") %>%
      rvest::html_text() %>%
      trimws()

Even code chunk given here returns empty results:

image

Rohit-Satyam commented 8 months ago

I tried rvest on my laptop and it's working. This is weird. Is it something related to multiple queries emanating from single IP?

hadley commented 8 months ago

It's almost certainly because some automated system is trying to block you from scraping.