Error with xml_serialize()/xml_unserialize() roundtrip: Opening and ending tag mismatch [PATCH] #407

HenrikBengtsson commented 11 months ago


xml_serialize()-xml_unserialize() roundtrip failes with: "Opening and ending tag mismatch: link line 12 and head [76]"

I'd expect a roundtrip to always work.

Reproducible Example

doc <- xml2::read_html("https://www.r-project.org")
#> {html_document}
#> <html lang="en">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body>\n    <div class="container page">\n      <div class="row">\n       ...

raw <- xml2::xml_serialize(doc, connection = NULL)
doc2 <- xml2::xml_unserialize(raw)
#> Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
#>   Opening and ending tag mismatch: link line 12 and head [76]


> traceback()
4: read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html, 
       options = options)
3: read_xml.character(unclass(object), ...)
2: read_xml(unclass(object), ...)
1: xml2::xml_unserialize(raw)
HenrikBengtsson commented 11 months ago

Same example using the example HTML file that comes with the package:

file <- system.file("extdata", "r-project.html", package = "xml2")
doc <- read_html(file)
#> [1] "xml_document" "xml_node"

raw <- xml2::xml_serialize(doc, connection = NULL)
doc2 <- xml2::xml_unserialize(raw)
#> Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
#>  Opening and ending tag mismatch: link line 12 and head [76]

Now, the problem seems to be with the xml_document class; it works with the xml_nodeset class:

children <- xml_children(doc)
#> [1] "xml_nodeset"

raw <- xml2::xml_serialize(children, connection = NULL)
children2 <- xml2::xml_unserialize(raw)
#> {xml_nodeset (2)}
#> [1] <head>\n  <meta http-equiv="Content-Type" content="text/html; charset=UTF ...
#> [2] <body>\n  <div class="container page">\n    <div class="row">\n      <div ...

all.equal(children2, children)
#> [1] TRUE
HenrikBengtsson commented 11 months ago

I think it's because xml_unserialize() attempts to read it as an XML file and not as an HTML file in:


We get the same error message if we try:

file <- system.file("extdata", "r-project.html", package = "xml2")
doc <- xml2::read_xml(file)
#> Error in read_xml.character(file) : 
#>   Opening and ending tag mismatch: link line 14 and head [76]
HenrikBengtsson commented 11 months ago

One solution for this it to have xml_serialize.xml_document() record also the document type and then have xml_unserialize() set as_html accordingly. Here's a working patch:

$ git diff -u R/xml_serialize.R
diff --git a/R/xml_serialize.R b/R/xml_serialize.R
index 3f7357f..74e1608 100644
--- a/R/xml_serialize.R
+++ b/R/xml_serialize.R
@@ -22,7 +22,7 @@ xml_serialize.xml_document <- function(object, connection, ...) {
     connection <- file(connection, "w", raw = TRUE)
-  serialize(structure(as.character(object, ...), class = "xml_serialized_document"), connection)
+  serialize(structure(as.character(object, ...), doc_type = doc_type(object), class = "xml_serialized_document"), connection)

 #' @export
@@ -64,7 +64,13 @@ xml_unserialize <- function(connection, ...) {
     # Select only the root
     res <- xml_find_first(x, "/node()")
   } else if (inherits(object, "xml_serialized_document")) {
-    res <- read_xml(unclass(object), ...)
+    read_xml_int <- function(object, as_html = FALSE, ...) {
+      if (missing(as_html)) {
+        as_html <- identical(attr(object, "doc_type", exact = TRUE), "html")
+      }
+      read_xml(unclass(object), as_html = as_html, ...)
+    }
+    res <- read_xml_int(unclass(object), ...)
   } else {
     stop("Not a serialized xml2 object", call. = FALSE)

I've submitted this patch in PR #408.