Garbling of reserved characters when reading in/updating a dataset

eeerika commented 3 years ago

library(EML)
url <- "https://arcticdata.io/metacat/d1/mn/v2/object/urn:uuid:cbac3efb-25bd-44d9-803a-73c1d0387844"
doc <- read_eml(url)
doc$dataset$abstract$para

[1] "This dataset contains PAPP OTC 4 Community data from the Latnjajaure site, Sweden in 1995, 1996, 1997 & 1998. The Press and Pulse Program (PAPP) experiment is comprised of four replicate blocks, each of which has four plots. These plots were subject to the OTC 4 treatment. For more information, please see the readme file."

The output of doc$dataset$abstract$para shows the "&" as "&" when it should just show "&" instead. This results in the "&" showing as "&amp;" on the dataset's web page once updated.

mbjones commented 3 years ago

@cboettig I took a quick peek at this with @eeerika and noticed that emld is treating para and section elements as literals, and thus failing to unescape XML entities such a &. See https://github.com/ropensci/emld/blob/master/R/as_jsonlist.R#L26 Can you elaborate on why these two element types are being treated specially? Seems to me like the simple fix would be to delete lines 26-28, but you must have a reason for that logic.

cboettig commented 3 years ago

@mbjones been a while so I could be mistaken, but...

most of EML translates fine into key-value pairs for json, list, etc, but para and section can contain docbook elements and there's no natural way to represent something like <para> a phrase with <strong>markup<strong/><para/> into key-value pairs. So textTypes are treated literally.

I think we could wrap it something like:

unescape_xml <- function(str){
  xml2::xml_text(xml2::read_xml(paste0("<x>", str, "</x>")))
}

to repair the text? (untested, not sure if that would round-trip safely either...)

mbjones commented 3 years ago

I see. Seems you are trying to preserve the docbook elements in the text value, even though they are just xml nodes just like all of the other elements. In emld, it might be cleaner to not differentiate para and section, and treat them like nodesets like everything else (in this case, mixed content node sets). Is the challenge that you might expect the constructor for para to take a character string with docbook xml in it, rather than a nodeset per se? If so, then I would argue that a specialized constructor for para and section could take a character string with docbook tags, convert it to a nodeset, and then add it to the emld model as a nodeset. That is, separate the special handling of para and section when constructing those nodes. Once the emld tree is built, the docbook nodes would show up as child nodes of para, etc. This would make the text nodes into hierarchies. For example, we would go from:

para:
    text: "a phrase with <strong>markup</strong>"

to this:

para:
    text: "a phrase with "
    strong:
        text: "markup"

cboettig commented 3 years ago

@mbjones one of the goals in emld was to generate viable JSON representation (and indeed valid JSON-LD representation, with clear semantics). HTML markup strings aren't uncommon in JSON, but I think

para:
    text: "a phrase with "
    strong:
        text: "markup"

is not particularly readable to a human, nor does it make semantic sense if serialized from JSON-LD to RDF. (and an R user would be very confused by seeing this parsed into the list structure of para = list(text = "a phase with", strong = list(text = "markup")), and would have to work much harder to make this workable text then if they can parse the raw docbook string)

We also have an ordering problem. key-value pairs are an unordered representation. (And semantic data practices tell us that information should be explicit, not encoded in order, right?) emld knows valid orderings required by EML, so it can still ensure the output XML format is valid without the user having to worry about that. But none of this translates to the markup formats above, which would get particularly messy with a lot of rich markup tags (headings, emphasis, subscripts, etc).

I think it is much better to treat text strings separately from other nodes. I think we can xml_unescape them, and we probably need to then also re xml_escape them when going the other way. In this way we provide the 'raw markup' which happens to be docbook-esque in this case, but also matches the raw markup you'd get using another markup language for text strings, like the new markdown format. There's plenty of examples of JSON-LD out there that have HTML markup in the value strings, as you know, and that seems to be the pattern we should follow.

mbjones commented 3 years ago

@cboettig I'm entirely convinced that your way is best, even if it requires some convoluted escaping. The ordering problem is particularly compelling on the JSON side, even though its not an issue in XML, and complicates the translation considerably. Thanks for clarifying.

I suspect the escaping is going to be more difficult than it seems on the surface. In particular, in this scenario, some disallowed characters would be escaped in the XML, and some would not.

TL;DR: A whole bunch of messed up examples follow...

Here's an example that illustrates the problem we'll have to deal with, starting with a valid EML snippet we might find in the wild:

<purpose>
    <para>So long as weak &lt;strong, and we <strong>properly</strong> treat ampersand (&amp;) characters</para>
</purpose>

When emld reads this according to your approach, I think $purpose$para would then return a value like this:

So long as weak <strong, and we <strong>properly</strong> treat ampersand (&) characters

So note in that case that we did unescape the XML Entities, making them look like the < opening the strong element tags. So, now imagine going back the other way, from the R string to XML. We now have 3 < signs in the R text, only the first of which should be escaped, along with the & which should be escaped. So our escaping routine will have to have enough knowledge of XML to properly detect the XML structure properly in the plain text string.

In addition, all of the following are equivalent from XML's perspective:

<para>So long as weak &lt;strong, and we <strong>properly</strong> treat ampersand (&amp;) characters</para>

<para>So long as weak &#60;strong, and we <strong>properly</strong> treat ampersand (&#38;) characters</para>

<para>So long as weak &lt;strong, and we <strong       > properly</strong    > treat ampersand (&amp;) characters</para>

<para>So long as weak &lt;strong, and we <strong
>
properly</strong
> treat ampersand (&amp;) characters</para>

Things like that last one with newlines inside the element tag and the previous one with spaces give me pause for using plain string parsing to determine whether or not to escape it. After we unescape it into R, we might see something like this in R:

So long as weak <strong, and we <strong\n
>\n
properly</strong\n
> treat ampersand (&) characters

When escaping that string, we'd have to escape the first < in that string, but not escape the second < in that string even though the character sequences are almost identical. The difference lies in the downstream > that follows arbitrary whitespace (and possibly other content). I don't think that is tractable without a formal parser. Hopefully I am mistaken and there's a good approach. One possible saving grace is that we could leverage the fact that EML only really allows a limited set of Docbook elements in TextType fields, and so there are very specific element names we'd need to detect in the plain R strings.

We should certainly include some encoding/decoding tests that exercise these funky formatted XML snippets. Here are some R text strings I think we should test with (\n represents a newline in these strings, \t represents a tab:

So long as weak <strong, and we <strong\n>\nproperly</strong\n> treat ampersand (&) characters
So long as weak <strong\t  , and weak<strong\t     foo bar baz  > properly </strong > treat ampersand (&) characters
XML escapes & with the string &amp; or with &#38;

ropensci / EML

Garbling of reserved characters when reading in/updating a dataset #315