Closed jeanetteclark closed 5 years ago
Very interesting. Couple of thoughts here:
xml2
could be partly involved here. write_eml
calls emld::as_xml
which calls xml2::write_xml
and I think xml2
pretty-prints by default. Adding newlines is common when pretty-printing.read_eml
which seems wrong. EML
does extra work to work with DocBook and there might be something odd going on here.yeah, I suspect docbook
or some combination of docbook
+xml2
.
Just to be pedantic, technically I think the contents the para[[1]]
element are interpereted as being literal XML, and as such the newlines are meaningless. Maybe I'm wrong, but I would think if Metacat parses that string as XML it shouldn't care how many newlines it finds? (but I do feel this has always been a weakness of the original EML spec in how TextType is defined makes things difficult for parsers)
yeah, if this is in docbook sections, the XML whitespace normalization rules apply, which means this can cause leading/trailing whitespace to be removed and runs of whitespace converted into a single space, depending on the processing rules in effect. DTD processing rules and XSD processing rules for whitespace can differ.
@amoeba the MetacatUI display issue is visible here: https://test.arcticdata.io/view/urn:uuid:27253c01-219e-4331-953e-133faed5304d
Note the space between the sub/superscripts where there shouldn't be one. I believe this is because of the \n
characters (which sound like maybe are being converted to single space based on Matt's comment above?)
Yeah, the newline ends up being treated as a single space in the HTML (MetacatUI is doing an XML->HTML conversion and ultimately lets the web browser handle the layout). What read_eml
is doing is turning this XML
<abstract>
<para>H<subscript>2</subscript>O</para>
</abstract>
into
list(para = "H\n<subscript>2</subscript>\nO")
which, at first glance doesn't make sense as there are no newlines in the document (pretty printing doesn't seem to be to blame).
I think I've narrowed it down to behavior in emld
. The newlines come in due to
https://github.com/ropensci/emld/blob/c67786b5eec985b9e5cf97411327372f75f662f4/R/as_jsonlist.R#L27
Specifically the collapse = "\n"
argument. Setting it to ""
makes the newlines go away:
# Current behavior
> paste(as.character(xml_contents(x)), collapse = "\n")
[1] "H\n<subscript>2</subscript>\nO"
# Proposed new behavior
> paste(as.character(xml_contents(x)), collapse = "")
[1] "H<subscript>2</subscript>O"
The reason collapse
even gets invoked is because xml2::xml_contents
is returning the three child nodes of the para
element which are:
I'll see if changing this breaks any tests in EML
or emld
and PR. Unless anyone can think of a reason not do to this?
Thanks all!
Within a
para
element, if there is a text formatting tag (such assuperscript
), when the EML is read into the environment, a new line is inserted after every formatting tag. This causes display issues in MetacatUI. Note that the newline is not inserted untilread_eml
is called.Here is an MRE:
Any idea what is causing this @amoeba or @cboettig? Thanks @dmullen17 for finding this one