ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
98 stars 33 forks source link

`read_eml` inserts `\n` after all text formatting tags #282

Closed jeanetteclark closed 5 years ago

jeanetteclark commented 5 years ago

Within a para element, if there is a text formatting tag (such as superscript), when the EML is read into the environment, a new line is inserted after every formatting tag. This causes display issues in MetacatUI. Note that the newline is not inserted until read_eml is called.

Here is an MRE:

library(EML)
library(stringr)
library(magrittr)

me <- list(individualName = list(givenName = "Jeanette", surName = "Clark"))
doc <- list(dataset = list(
  title = "A Minimal Valid EML Dataset",
  creator = me,
  contact = me)
)

doc$dataset$abstract <- list(para = list("HCO3, 18O, 14C, 88, some other stuff", "13O, 86Sr, 88Sr"))

n_para <- length(doc$dataset$abstract$para)

for (i in 1:n_para) {
  doc$dataset$abstract$para[[i]] <- str_replace_all(doc$dataset$abstract$para[[i]], 'CO2', 'CO<subscript>2</subscript>') %>%
    str_replace_all('HCO3', 'HCO<subscript>3</subscript>') %>%
    str_replace_all('18O', '<superscript>18</superscript>O') %>%
    str_replace_all('14C', '<superscript>14</superscript>C') %>%
    str_replace_all('88', '<superscript>88</superscript>') %>%
    str_replace_all('13O', '<superscript>13</superscript>O') %>%
    str_replace_all('86Sr', '<superscript>86</superscript>SR') %>%
    str_replace_all('88Sr', '<superscript>88</superscript>SR')
}

# no new line characters
doc$dataset$abstract$para[[1]]
#> [1] "HCO<subscript>3</subscript>, <superscript>18</superscript>O, <superscript>14</superscript>C, <superscript>88</superscript>, some other stuff"

write_eml(doc, "example.xml")
eml_validate("example.xml")
#> [1] TRUE
#> attr(,"errors")
#> character(0)

doc_read <- read_eml("example.xml")

# new line characters after every superscript
doc_read$dataset$abstract$para[[1]]
#> [1] "HCO\n<subscript>3</subscript>\n, \n<superscript>18</superscript>\nO, \n<superscript>14</superscript>\nC, \n<superscript>88</superscript>\n, some other stuff"

Any idea what is causing this @amoeba or @cboettig? Thanks @dmullen17 for finding this one

amoeba commented 5 years ago

Very interesting. Couple of thoughts here:

cboettig commented 5 years ago

yeah, I suspect docbook or some combination of docbook+xml2.

Just to be pedantic, technically I think the contents the para[[1]] element are interpereted as being literal XML, and as such the newlines are meaningless. Maybe I'm wrong, but I would think if Metacat parses that string as XML it shouldn't care how many newlines it finds? (but I do feel this has always been a weakness of the original EML spec in how TextType is defined makes things difficult for parsers)

mbjones commented 5 years ago

yeah, if this is in docbook sections, the XML whitespace normalization rules apply, which means this can cause leading/trailing whitespace to be removed and runs of whitespace converted into a single space, depending on the processing rules in effect. DTD processing rules and XSD processing rules for whitespace can differ.

jeanetteclark commented 5 years ago

@amoeba the MetacatUI display issue is visible here: https://test.arcticdata.io/view/urn:uuid:27253c01-219e-4331-953e-133faed5304d

Note the space between the sub/superscripts where there shouldn't be one. I believe this is because of the \n characters (which sound like maybe are being converted to single space based on Matt's comment above?)

amoeba commented 5 years ago

Yeah, the newline ends up being treated as a single space in the HTML (MetacatUI is doing an XML->HTML conversion and ultimately lets the web browser handle the layout). What read_eml is doing is turning this XML

    <abstract>
      <para>H<subscript>2</subscript>O</para>
    </abstract>

into

list(para = "H\n<subscript>2</subscript>\nO")

which, at first glance doesn't make sense as there are no newlines in the document (pretty printing doesn't seem to be to blame).

amoeba commented 5 years ago

I think I've narrowed it down to behavior in emld. The newlines come in due to

https://github.com/ropensci/emld/blob/c67786b5eec985b9e5cf97411327372f75f662f4/R/as_jsonlist.R#L27

Specifically the collapse = "\n" argument. Setting it to "" makes the newlines go away:

# Current behavior
> paste(as.character(xml_contents(x)), collapse = "\n")
[1] "H\n<subscript>2</subscript>\nO"

# Proposed new behavior
> paste(as.character(xml_contents(x)), collapse = "")
[1] "H<subscript>2</subscript>O"

The reason collapse even gets invoked is because xml2::xml_contents is returning the three child nodes of the para element which are:

I'll see if changing this breaks any tests in EML or emld and PR. Unless anyone can think of a reason not do to this?

cboettig commented 5 years ago

Thanks all!