r-lib / xml2

Bindings to libxml2
https://xml2.r-lib.org/
Other
220 stars 81 forks source link

xml2 read_html removes closing tags from JSON-LD when using a single option #373

Open sbha opened 2 years ago

sbha commented 2 years ago

xml2::read_html(x) returns the HTML within a linked data JSON object as expected:

library(xml2)
library(magrittr)
library(rvest)

test_ld <- '<script type="application/ld+json">{"@context":"http://schema.org","@type":"ReproducibleExample", "description":"<p><strong>text within tags</strong>text after closing tag</p>"'

# tags preserved
test_ld %>% 
  read_html() %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tags</strong>text after closing tag</p>\"</script>"

Where description contains the HTML <p><strong>text within tags</strong>text after closing tag</p>

But if using xml2::read_html(x, options = 'HUGE') or with any single option (I've tested 5 or 6), the closing tags are removed from the HTML text in a JSON-LD object.

# tags removed
test_ld %>% 
  read_html(options = 'HUGE') %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# removed
test_ld %>% 
  read_html(options = "NOBLANKS") %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# removed
test_ld %>% 
  read_html(options = '') %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# all return:
[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tagstext after closing tag\"</script

description now becomes <p><strong>text within tagstext after closing tag

Setting options is necessary for some of the HTML I'm parsing. Is it possible to use options and preserve properly formatted HTML from a linked data object?

sbha commented 2 years ago

If multiple options are set the HTML is correct:

test_ld %>% 
  read_html(options = c("RECOVER", "NOERROR", "NOBLANKS")) %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# or
test_ld %>% 
  read_html(options = c("HUGE", "RECOVER")) %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tags</strong>text after closing tag</p>\"</script>"

description is as it should be <p><strong>text within tags</strong>text after closing tag</p>

hadley commented 1 year ago

I'm not sure there's much we can do here, but leaving open because I have some suspicions that something is going wrong with the way we pass the options from R to C.