tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 341 forks source link

html_element() cannot select itself #382

Open JosiahParry opened 8 months ago

JosiahParry commented 8 months ago

After using html_children() the contents cannot be access using html_element() or html_elements().

I would not be surprised if this is user error, I'm just not sure where.

library(rvest)
html <- minimal_html(r"{
<div class="div-class">
  <h1 class="my-class">Hello</h1>
  <h2 class="subclass">World</h2>
</div>
}")

html_elements(html, ".div-class .my-class")
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>

div_children <- html_elements(html, ".div-class") |> 
  html_children() 

div_children
#> {xml_nodeset (2)}
#> [1] <h1 class="my-class">Hello</h1>
#> [2] <h2 class="subclass">World</h2>

html_elements(div_children, ".my-class")
#> {xml_nodeset (0)}

Created on 2023-12-22 with reprex v2.0.2

rossellhayes commented 8 months ago

When a CSS selector is passed to html_elements(), it is converted to Xpath with rvest:::make_selector(). make_selector() always prefixes the Xpath with .//, which means it can find nodes at all levels except for the top level. Because div_children is the result of html_children(), the nodes in question are top-level. In order to avoid that, you can avoid rvest's prefixing by handling the conversion to Xpath outside the function:

html_elements(div_children, xpath = selectr::css_to_xpath(".my-class"))
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>

Created on 2023-12-22 with reprex v2.0.2

I believe this issue could be fixed in rvest by changing https://github.com/tidyverse/rvest/blob/main/R/selectors.R#L99C41-L99C41 to prefix descendant-or-self:: rather than .//. However, this test suggests that not being able to select top-level nodes is a purposeful design decision.

JosiahParry commented 8 months ago

@rossellhayes well that beats my suggestion:

x <- xml_new_root("tmp")

for (child in div_children) {
  xml_add_child(x, child)
}

html_elements(x, ".my-class")
JosiahParry commented 8 months ago

Thanks @rossellhayes. I'm wondering if there's something more going on here that I'm not able to grasp or is actually a bug whereas the previous was not per your findings in the test.

Using the selectr package to identify nodes does not permit removal from the document with xml2::xml_remove(). I wonder if this is another case (or the same) in which top level items are treated differently.

library(xml2)
library(rvest)
html <- minimal_html(r"{
<div class="div-class">
  <h1 class="my-class">Hello</h1>
  <h2 class="subclass">World</h2>
</div>
}")

html_elements(html, ".div-class .my-class")
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>

div_children <- html_elements(html, ".div-class") |> 
  html_children() 

# select using selectr
html_elements(div_children, xpath = selectr::css_to_xpath(".my-class"))
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>

# remove the node using xml_remove
xml2::xml_remove(
  html_elements(div_children, xpath = selectr::css_to_xpath(".my-class"))
)

# see if its still there
html_elements(div_children, xpath = selectr::css_to_xpath(".my-class"))
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>

# repeat at the top level html
xml2::xml_remove(html_elements(html, ".div-class .my-class"))

# see if it is still there
html_elements(html, ".div-class .my-class")
#> {xml_nodeset (0)}

Created on 2023-12-22 with reprex v2.0.2

EDIT: ignore me. It seems free = TRUE must be set when its a subset of nodes

# remove the node using xml_remove
xml2::xml_remove(
  html_elements(div_children, xpath = selectr::css_to_xpath(".my-class")),
  free = TRUE
)
hadley commented 7 months ago

html_elements() select children elements, and are design not to select the elements themselves (otherwise this can make recursing over a document very tricky). This is probably worth a clarifying sentence in the docs.