Closed The-Janitor closed 3 years ago
I can confirm.
read_html(" <h1 class=\"article-title\">
<p>
Beyond imports: The supply chain effects of trade protection on export growth </p>
</h1>") %>% html_node(".article-title") %>% html_text()
gives the same result.
Thats's because the input is invalid HTML (you can't put a <p>
inside of <h1>
), and it automatically gets transformed to something valid:
cat(as.character(xml2::read_html("
<h1 class=\"article-title\">
<p>
Beyond imports: The supply chain effects of trade protection on export growth</p>
</h1>")
))
#> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#> <html><body>
#> <h1 class="article-title">
#> </h1>
#> <p>
#> Beyond imports: The supply chain effects of trade protection on export growth</p>
#> </body></html>
Created on 2020-12-14 by the reprex package (v0.3.0.9001)
I've encountered an issue I have never seen before when trying to scrape specific element from website.
Website: https://voxeu.org/article/design-choices-central-bank-digital-currency
The structure of the element I', trying to extract:
Reprex
which returns
[1] "\n "
If I e.g. want to extract
.article-content
there's no problem at all. It doesn't look like there are any JS involved.Best regards Michael