Open sbha opened 2 years ago
If multiple options are set the HTML is correct:
test_ld %>%
read_html(options = c("RECOVER", "NOERROR", "NOBLANKS")) %>%
html_node('script[type="application/ld+json"]') %>%
as.character()
# or
test_ld %>%
read_html(options = c("HUGE", "RECOVER")) %>%
html_node('script[type="application/ld+json"]') %>%
as.character()
[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tags</strong>text after closing tag</p>\"</script>"
description is as it should be <p><strong>text within tags</strong>text after closing tag</p>
I'm not sure there's much we can do here, but leaving open because I have some suspicions that something is going wrong with the way we pass the options from R to C.
xml2::read_html(x)
returns the HTML within a linked data JSON object as expected:Where description contains the HTML
<p><strong>text within tags</strong>text after closing tag</p>
But if using
xml2::read_html(x, options = 'HUGE')
or with any single option (I've tested 5 or 6), the closing tags are removed from the HTML text in a JSON-LD object.description now becomes
<p><strong>text within tagstext after closing tag
Setting options is necessary for some of the HTML I'm parsing. Is it possible to use options and preserve properly formatted HTML from a linked data object?