r-lib / xml2

Bindings to libxml2
https://xml2.r-lib.org/
Other
218 stars 83 forks source link

Error with xml_serialize()/xml_unserialize() roundtrip: Opening and ending tag mismatch [PATCH] #407

Closed HenrikBengtsson closed 10 months ago

HenrikBengtsson commented 11 months ago

Issue

xml_serialize()-xml_unserialize() roundtrip failes with: "Opening and ending tag mismatch: link line 12 and head [76]"

I'd expect a roundtrip to always work.

Reproducible Example

doc <- xml2::read_html("https://www.r-project.org")
doc
#> {html_document}
#> <html lang="en">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body>\n    <div class="container page">\n      <div class="row">\n       ...

raw <- xml2::xml_serialize(doc, connection = NULL)
doc2 <- xml2::xml_unserialize(raw)
#> Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
#>   Opening and ending tag mismatch: link line 12 and head [76]

Traceback:

> traceback()
4: read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html, 
       options = options)
3: read_xml.character(unclass(object), ...)
2: read_xml(unclass(object), ...)
1: xml2::xml_unserialize(raw)
Session Info ```r > devtools::session_info() # Paste output below ─ Session info ───────────────────── setting value version R version 4.3.1 (2023-06-16) os Ubuntu 22.04.3 LTS system x86_64, linux-gnu ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/Los_Angeles date 2023-10-03 pandoc 3.1.7 @ /home/henrik/shared/software/CBI/pandoc-3.1.7/bin/pandoc ─ Packages ────────────────────── package * version date (UTC) lib source cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.0) callr 3.7.3 2022-11-02 [1] CRAN (R 4.3.0) cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0) crayon 1.5.2 2022-09-29 [1] RSPM (R 4.3.0) devtools 2.4.5 2022-10-11 [1] RSPM (R 4.3.0) digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1) ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.0) fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0) fs 1.6.3 2023-07-20 [1] RSPM (R 4.3.0) glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0) htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.1) htmlwidgets 1.6.2 2023-03-17 [1] RSPM (R 4.3.0) httpuv 1.6.11 2023-05-11 [1] RSPM (R 4.3.0) later 1.3.1 2023-05-02 [1] CRAN (R 4.3.0) lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0) magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.3.0) memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.0) mime 0.12 2021-09-28 [1] CRAN (R 4.3.0) miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.3.0) pkgbuild 1.4.2 2023-06-26 [1] CRAN (R 4.3.1) pkgload 1.3.3 2023-09-22 [1] CRAN (R 4.3.1) prettyunits 1.2.0 2023-09-24 [1] RSPM (R 4.3.0) processx 3.8.2 2023-06-30 [1] CRAN (R 4.3.1) profvis 0.3.8 2023-05-02 [1] RSPM (R 4.3.0) promises 1.2.1 2023-08-10 [1] CRAN (R 4.3.1) ps 1.7.5 2023-04-18 [1] RSPM (R 4.3.0) purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.1) R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0) Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.1) remotes 2.4.2.1 2023-07-18 [1] CRAN (R 4.3.1) rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0) sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0) shiny 1.7.5 2023-08-12 [1] RSPM (R 4.3.0) stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0) stringr 1.5.0 2022-12-02 [1] CRAN (R 4.3.0) urlchecker 1.0.1 2021-11-30 [1] RSPM (R 4.3.0) usethis 2.2.2 2023-07-06 [1] CRAN (R 4.3.1) vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0) xtable 1.8-4 2019-04-21 [1] RSPM (R 4.3.0) [1] /home/henrik/R/ubuntu22_04-x86_64-pc-linux-gnu-library/4.3-CBI-gcc11 [2] /home/henrik/shared/software/CBI/_ubuntu22_04/R-4.3.1-gcc11/lib/R/library ```
HenrikBengtsson commented 11 months ago

Same example using the example HTML file that comes with the package:

library(xml2)
file <- system.file("extdata", "r-project.html", package = "xml2")
doc <- read_html(file)
class(doc)
#> [1] "xml_document" "xml_node"

raw <- xml2::xml_serialize(doc, connection = NULL)
doc2 <- xml2::xml_unserialize(raw)
#> Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
#>  Opening and ending tag mismatch: link line 12 and head [76]

Now, the problem seems to be with the xml_document class; it works with the xml_nodeset class:

children <- xml_children(doc)
class(children)
#> [1] "xml_nodeset"

raw <- xml2::xml_serialize(children, connection = NULL)
children2 <- xml2::xml_unserialize(raw)
#> {xml_nodeset (2)}
#> [1] <head>\n  <meta http-equiv="Content-Type" content="text/html; charset=UTF ...
#> [2] <body>\n  <div class="container page">\n    <div class="row">\n      <div ...

all.equal(children2, children)
#> [1] TRUE
HenrikBengtsson commented 11 months ago

I think it's because xml_unserialize() attempts to read it as an XML file and not as an HTML file in:

https://github.com/r-lib/xml2/blob/ef2310ba246aab7842100b271a610ebfa353ee16/R/xml_serialize.R#L66-L67

We get the same error message if we try:

file <- system.file("extdata", "r-project.html", package = "xml2")
doc <- xml2::read_xml(file)
#> Error in read_xml.character(file) : 
#>   Opening and ending tag mismatch: link line 14 and head [76]
HenrikBengtsson commented 11 months ago

One solution for this it to have xml_serialize.xml_document() record also the document type and then have xml_unserialize() set as_html accordingly. Here's a working patch:

$ git diff -u R/xml_serialize.R
diff --git a/R/xml_serialize.R b/R/xml_serialize.R
index 3f7357f..74e1608 100644
--- a/R/xml_serialize.R
+++ b/R/xml_serialize.R
@@ -22,7 +22,7 @@ xml_serialize.xml_document <- function(object, connection, ...) {
     connection <- file(connection, "w", raw = TRUE)
     on.exit(close(connection))
   }
-  serialize(structure(as.character(object, ...), class = "xml_serialized_document"), connection)
+  serialize(structure(as.character(object, ...), doc_type = doc_type(object), class = "xml_serialized_document"), connection)
 }

 #' @export
@@ -64,7 +64,13 @@ xml_unserialize <- function(connection, ...) {
     # Select only the root
     res <- xml_find_first(x, "/node()")
   } else if (inherits(object, "xml_serialized_document")) {
-    res <- read_xml(unclass(object), ...)
+    read_xml_int <- function(object, as_html = FALSE, ...) {
+      if (missing(as_html)) {
+        as_html <- identical(attr(object, "doc_type", exact = TRUE), "html")
+      }
+      read_xml(unclass(object), as_html = as_html, ...)
+    }
+    res <- read_xml_int(unclass(object), ...)
   } else {
     stop("Not a serialized xml2 object", call. = FALSE)
   }

I've submitted this patch in PR #408.