ropensci / emld

:package: JSON-LD representation of EML
https://docs.ropensci.org/emld
Other
13 stars 6 forks source link

Issue with textType serialization #36

Closed amoeba closed 5 years ago

amoeba commented 5 years ago

@srearl popped in to NCEAS EML Slack today with some weird emld behavior:

writeLines(
  as.character(
    emld::as_xml(
      list(additionalInfo = list(
        section = list(para = "some para"))))))

produces

<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1/ eml.xsd">
  <additionalInfo>
    <section>
      <section>some para</section>
    </section>
  </additionalInfo>
</eml:eml>

Instead of the intended

<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1/ eml.xsd">
  <additionalInfo>
    <section>
      <para>some para</para>
    </section>
  </additionalInfo>
</eml:eml>

I poked around at the source a bit and didn't quite see what's up but wanted to file an issue. I can look again later on this week I bet.

cboettig commented 5 years ago

Thanks for the note.

The short answer is that the content of textType nodes like section or para are treated as literal, so you would have to write literal XML:

writeLines(
  as.character(
    emld::as_xml(
      list(additionalInfo = list(section = "<para>some para</para>")))))

On a practical side, I don't think there's a compelling use case for anyone to either structure or parse text manually like in the problematic examples. Once you say section, what follows is some text that should really be left as XML, or be imported directly from some other format (MS Word, Markdown rendering, etc, like we do in the EML package). I believe that turning each para into it's own JSON key / list object would make this text harder to work with, not easier. Unlike the rest of EML, A bunch of <para> items and <title> items etc really can only be understood as XML and cannot and should not be interpreted as key-value pairs.

The difficulty is that textType embraces a whole bunch of DocBook that, unlike the rest of EML, cannot be expressed in key-value pairs. This is to me an excellent example of the fundamental philosophical difference between JSON and XML, XML is markup and can do stuff like <para> some <b>bold</b> text</para> which has no analog translation into JSON, or RDF concepts for that matter. (or even an object-oriented S4, this problem also impacted the S4 version of the package). I think the main reason JSON is easier than XML to work with is precisely because JSON can't do markup, it can strictly only represent key-value pairs. In the emld model, (indeed, in any RDF worldview) all textType content is just a 'value', it's not meant to be decomposed.

... hehe, wow, apparently I have more opinions on this thing than I realized. anyway, hope this helps some and happy to be convinced that we should change something.

srearl commented 5 years ago

Thank you very much for giving this some thought, @cboettig, these are excellent points. I think you are absolutely correct from the perspective of machine readability and interpretability. A challenge is when we seek to enhance or convey more human readable information within the constraints of XML/EML, which, of course, depends also on which and how EML components are interpreted and displayed (for example by the EDI data portal). @amoeba had suggested (wisely) in our Slack thread that Markdown would be an alternative and, in fact, better approach. I agree, and am very much looking forward to Markdown support in EML 2.2.

cboettig commented 5 years ago

@srearl Note that you can already use Markdown as an input in EML 2.1.1 by letting the EML::setTextType function translate the markdown into the XML tags.

I think this approach of separating out non-trivial markup text from the eml construction bits is a bit cleaner to read than the above.

writeLines('
## General Protocols

Field methods. All experiments will be carried out in the greenhouse at Harvard Forest. We have developed an instrumentation system ....

Proteomic analysis. Proteomic profiles of microbial communities are determined after separating the microbial ...

## Specific Experiments

Experiment #1. Effects of nutrient enrichment on state changes and [O2] profiles. This experiment alters ...',

           "section.md")

eml <- 
  list(additionalInfo =  EML::set_TextType("section.md") )

And observe the XML we get back:

writeLines(as.character(
    emld::as_xml(eml)
  ))
<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1/ eml.xsd">
  <additionalInfo>
    <section>
      <title>General Protocols</title>
      <para>
    Field methods. All experiments will be carried out in the greenhouse
    at Harvard Forest. We have developed an instrumentation system ….
  </para>
      <para>
    Proteomic analysis. Proteomic profiles of microbial communities are
    determined after separating the microbial …
  </para>
    </section>
    <section>
      <title>Specific Experiments</title>
      <para>
    Experiment #1. Effects of nutrient enrichment on state changes and
    [O2] profiles. This experiment alters …
  </para>
    </section>
  </additionalInfo>
</eml:eml>
amoeba commented 5 years ago

Thanks for the thoughts and example code, @cboettig. I can live with this and your explanation makes sense re: no good analog.