Open JosiahParry opened 8 months ago
When a CSS selector is passed to html_elements()
, it is converted to Xpath with rvest:::make_selector()
. make_selector()
always prefixes the Xpath with .//
, which means it can find nodes at all levels except for the top level. Because div_children
is the result of html_children()
, the nodes in question are top-level. In order to avoid that, you can avoid rvest
's prefixing by handling the conversion to Xpath outside the function:
html_elements(div_children, xpath = selectr::css_to_xpath(".my-class"))
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>
Created on 2023-12-22 with reprex v2.0.2
I believe this issue could be fixed in rvest
by changing https://github.com/tidyverse/rvest/blob/main/R/selectors.R#L99C41-L99C41 to prefix descendant-or-self::
rather than .//
. However, this test suggests that not being able to select top-level nodes is a purposeful design decision.
@rossellhayes well that beats my suggestion:
x <- xml_new_root("tmp")
for (child in div_children) {
xml_add_child(x, child)
}
html_elements(x, ".my-class")
Thanks @rossellhayes. I'm wondering if there's something more going on here that I'm not able to grasp or is actually a bug whereas the previous was not per your findings in the test.
Using the selectr
package to identify nodes does not permit removal from the document with xml2::xml_remove()
. I wonder if this is another case (or the same) in which top level items are treated differently.
library(xml2)
library(rvest)
html <- minimal_html(r"{
<div class="div-class">
<h1 class="my-class">Hello</h1>
<h2 class="subclass">World</h2>
</div>
}")
html_elements(html, ".div-class .my-class")
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>
div_children <- html_elements(html, ".div-class") |>
html_children()
# select using selectr
html_elements(div_children, xpath = selectr::css_to_xpath(".my-class"))
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>
# remove the node using xml_remove
xml2::xml_remove(
html_elements(div_children, xpath = selectr::css_to_xpath(".my-class"))
)
# see if its still there
html_elements(div_children, xpath = selectr::css_to_xpath(".my-class"))
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>
# repeat at the top level html
xml2::xml_remove(html_elements(html, ".div-class .my-class"))
# see if it is still there
html_elements(html, ".div-class .my-class")
#> {xml_nodeset (0)}
Created on 2023-12-22 with reprex v2.0.2
EDIT: ignore me. It seems free = TRUE
must be set when its a subset of nodes
# remove the node using xml_remove
xml2::xml_remove(
html_elements(div_children, xpath = selectr::css_to_xpath(".my-class")),
free = TRUE
)
html_elements()
select children elements, and are design not to select the elements themselves (otherwise this can make recursing over a document very tricky). This is probably worth a clarifying sentence in the docs.
After using
html_children()
the contents cannot be access usinghtml_element()
orhtml_elements()
.I would not be surprised if this is user error, I'm just not sure where.
Created on 2023-12-22 with reprex v2.0.2