tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 343 forks source link

Can't extract element from specific website #285

Closed The-Janitor closed 3 years ago

The-Janitor commented 4 years ago

I've encountered an issue I have never seen before when trying to scrape specific element from website.

Website: https://voxeu.org/article/design-choices-central-bank-digital-currency

The structure of the element I', trying to extract:

 <h1 class="article-title">
    <p>
    Beyond imports: The supply chain effects of trade protection on export growth   </p>
    </h1>

Reprex

library(rvest)

sub_page <-  read_html("http://voxeu.org/article/design-choices-central-bank-digital-currency")

title <-  sub_page %>%
html_nodes(".article-title") %>%
html_text()

which returns [1] "\n "

If I e.g. want to extract .article-content there's no problem at all. It doesn't look like there are any JS involved.

Best regards Michael

xwhitelight commented 4 years ago

I can confirm.

read_html(" <h1 class=\"article-title\">
    <p>
    Beyond imports: The supply chain effects of trade protection on export growth   </p>
    </h1>") %>% html_node(".article-title") %>% html_text()

gives the same result.

hadley commented 3 years ago

Thats's because the input is invalid HTML (you can't put a <p> inside of <h1>), and it automatically gets transformed to something valid:

cat(as.character(xml2::read_html("
  <h1 class=\"article-title\">
  <p>
  Beyond imports: The supply chain effects of trade protection on export growth</p>
  </h1>")
))
#> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#> <html><body>
#> <h1 class="article-title">
#>   </h1>
#> <p>
#>   Beyond imports: The supply chain effects of trade protection on export growth</p>
#>   </body></html>

Created on 2020-12-14 by the reprex package (v0.3.0.9001)