Closed HenrikBengtsson closed 1 year ago
Same example using the example HTML file that comes with the package:
library(xml2)
file <- system.file("extdata", "r-project.html", package = "xml2")
doc <- read_html(file)
class(doc)
#> [1] "xml_document" "xml_node"
raw <- xml2::xml_serialize(doc, connection = NULL)
doc2 <- xml2::xml_unserialize(raw)
#> Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html, :
#> Opening and ending tag mismatch: link line 12 and head [76]
Now, the problem seems to be with the xml_document
class; it works with the xml_nodeset
class:
children <- xml_children(doc)
class(children)
#> [1] "xml_nodeset"
raw <- xml2::xml_serialize(children, connection = NULL)
children2 <- xml2::xml_unserialize(raw)
#> {xml_nodeset (2)}
#> [1] <head>\n <meta http-equiv="Content-Type" content="text/html; charset=UTF ...
#> [2] <body>\n <div class="container page">\n <div class="row">\n <div ...
all.equal(children2, children)
#> [1] TRUE
I think it's because xml_unserialize()
attempts to read it as an XML file and not as an HTML file in:
We get the same error message if we try:
file <- system.file("extdata", "r-project.html", package = "xml2")
doc <- xml2::read_xml(file)
#> Error in read_xml.character(file) :
#> Opening and ending tag mismatch: link line 14 and head [76]
One solution for this it to have xml_serialize.xml_document()
record also the document type and then have xml_unserialize()
set as_html
accordingly. Here's a working patch:
$ git diff -u R/xml_serialize.R
diff --git a/R/xml_serialize.R b/R/xml_serialize.R
index 3f7357f..74e1608 100644
--- a/R/xml_serialize.R
+++ b/R/xml_serialize.R
@@ -22,7 +22,7 @@ xml_serialize.xml_document <- function(object, connection, ...) {
connection <- file(connection, "w", raw = TRUE)
on.exit(close(connection))
}
- serialize(structure(as.character(object, ...), class = "xml_serialized_document"), connection)
+ serialize(structure(as.character(object, ...), doc_type = doc_type(object), class = "xml_serialized_document"), connection)
}
#' @export
@@ -64,7 +64,13 @@ xml_unserialize <- function(connection, ...) {
# Select only the root
res <- xml_find_first(x, "/node()")
} else if (inherits(object, "xml_serialized_document")) {
- res <- read_xml(unclass(object), ...)
+ read_xml_int <- function(object, as_html = FALSE, ...) {
+ if (missing(as_html)) {
+ as_html <- identical(attr(object, "doc_type", exact = TRUE), "html")
+ }
+ read_xml(unclass(object), as_html = as_html, ...)
+ }
+ res <- read_xml_int(unclass(object), ...)
} else {
stop("Not a serialized xml2 object", call. = FALSE)
}
I've submitted this patch in PR #408.
Issue
xml_serialize()
-xml_unserialize()
roundtrip failes with: "Opening and ending tag mismatch: link line 12 and head [76]"I'd expect a roundtrip to always work.
Reproducible Example
Traceback:
Session Info
```r > devtools::session_info() # Paste output below ─ Session info ───────────────────── setting value version R version 4.3.1 (2023-06-16) os Ubuntu 22.04.3 LTS system x86_64, linux-gnu ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/Los_Angeles date 2023-10-03 pandoc 3.1.7 @ /home/henrik/shared/software/CBI/pandoc-3.1.7/bin/pandoc ─ Packages ────────────────────── package * version date (UTC) lib source cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.0) callr 3.7.3 2022-11-02 [1] CRAN (R 4.3.0) cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0) crayon 1.5.2 2022-09-29 [1] RSPM (R 4.3.0) devtools 2.4.5 2022-10-11 [1] RSPM (R 4.3.0) digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1) ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.0) fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0) fs 1.6.3 2023-07-20 [1] RSPM (R 4.3.0) glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0) htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.1) htmlwidgets 1.6.2 2023-03-17 [1] RSPM (R 4.3.0) httpuv 1.6.11 2023-05-11 [1] RSPM (R 4.3.0) later 1.3.1 2023-05-02 [1] CRAN (R 4.3.0) lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0) magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.3.0) memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.0) mime 0.12 2021-09-28 [1] CRAN (R 4.3.0) miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.3.0) pkgbuild 1.4.2 2023-06-26 [1] CRAN (R 4.3.1) pkgload 1.3.3 2023-09-22 [1] CRAN (R 4.3.1) prettyunits 1.2.0 2023-09-24 [1] RSPM (R 4.3.0) processx 3.8.2 2023-06-30 [1] CRAN (R 4.3.1) profvis 0.3.8 2023-05-02 [1] RSPM (R 4.3.0) promises 1.2.1 2023-08-10 [1] CRAN (R 4.3.1) ps 1.7.5 2023-04-18 [1] RSPM (R 4.3.0) purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.1) R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0) Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.1) remotes 2.4.2.1 2023-07-18 [1] CRAN (R 4.3.1) rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0) sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0) shiny 1.7.5 2023-08-12 [1] RSPM (R 4.3.0) stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0) stringr 1.5.0 2022-12-02 [1] CRAN (R 4.3.0) urlchecker 1.0.1 2021-11-30 [1] RSPM (R 4.3.0) usethis 2.2.2 2023-07-06 [1] CRAN (R 4.3.1) vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0) xtable 1.8-4 2019-04-21 [1] RSPM (R 4.3.0) [1] /home/henrik/R/ubuntu22_04-x86_64-pc-linux-gnu-library/4.3-CBI-gcc11 [2] /home/henrik/shared/software/CBI/_ubuntu22_04/R-4.3.1-gcc11/lib/R/library ```