set_TextType can in some cases produce invalid EML

ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data

https://docs.ropensci.org/EML

Other

98 stars 33 forks source link

set_TextType can in some cases produce invalid EML #217

Closed amoeba closed 5 years ago

amoeba commented 7 years ago

Given a markdown doc, 'complex_abstract.md':

# Header One

An unordered list follows this text:

- Item one
- Item two
- Item three

That was a list.

## Subheader One

1. This
2. is
3. an
4. ordered
5. list

the following R script indicates the resulting EML document is not valid:

library(EML)

f <- system.file("examples/hf205.xml", package = "EML")
eml <- read_eml(f)

eml@dataset@abstract <- as(set_TextType(system.file("tests", "testthat", "complex_abstract.md", package = "EML")), "abstract")
eml_validate(eml)

[1] FALSE
attr(,"errors")
[1] "Element 'itemizedlist': This element is not expected."

You can see what's going on if you view the serialized XML:


> eml@dataset@abstract

<abstract>
  <section>
    <title>Header One</title>
    <para>
    An unordered list follows this text:
  </para>
    <itemizedlist spacing="compact">
      <listitem>
        <para>
        Item one
      </para>
      </listitem>
      <listitem>
        <para>
        Item two
      </para>
      </listitem>
      <listitem>
        <para>
        Item three
      </para>
      </listitem>
    </itemizedlist>
    <para>
    That was a list.
  </para>
    <sect2 id="subheader-one">
      <title>Subheader One</title>
      <orderedlist numeration="arabic" spacing="compact">
        <listitem>
          <para>
          This
        </para>
        </listitem>
        <listitem>
          <para>
          is
        </para>
        </listitem>
        <listitem>
          <para>
          an
        </para>
        </listitem>
        <listitem>
          <para>
          ordered
        </para>
        </listitem>
        <listitem>
          <para>
          list
        </para>
        </listitem>
      </orderedlist>
    </sect2>
  </section>
</abstract>

My assessment is that element itemizedlist and orderedlist need to go inside para elements always.

Is this unintended behavior?
Does the EML TextType not really support DocBook? As far as I can tell, the functionality inside the EML package is very lightweight and depends only on the Pandoc conversion from markdown to docbook.

amoeba commented 7 years ago

I played around with this and I find that Pandoc produces invalid DocBook XML by default when converting from Markdown. I'm sure there's something I don't know that would be really helpful right now.

pandoc -f markdown -t docbook complex_abstract.md -o abstract.xml
xerces -v -s -hs abstract.xml
[Error] abstract.xml:4:26: Document root element "section", must match DOCTYPE root "article".
[Error] abstract.xml:59:11: The content of element type "section" must match "(sectioninfo?,(title,subtitle?,titleabbrev?),(toc|lot|index|glossary|bibliography)*,(((calloutlist|glosslist|bibliolist|itemizedlist|orderedlist|segmentedlist|simplelist|variablelist|caution|important|note|tip|warning|literallayout|programlisting|programlistingco|screen|screenco|screenshot|synopsis|cmdsynopsis|funcsynopsis|classsynopsis|fieldsynopsis|constructorsynopsis|destructorsynopsis|methodsynopsis|formalpara|para|simpara|address|blockquote|graphic|graphicco|mediaobject|mediaobjectco|informalequation|informalexample|informalfigure|informaltable|equation|example|figure|table|msgset|procedure|sidebar|qandaset|task|anchor|bridgehead|remark|highlights|abstract|authorblurb|epigraph|indexterm|beginpage)+,(refentry*|section*|simplesect*))|refentry+|section+|simplesect+),(toc|lot|index|glossary|bibliography)*)".
abstract.xml: 4025 ms (24 elems, 7 attrs, 203 spaces, 284 chars)

after some quick fixes I was able to make it valid. This was a bit of an aside because the valid docbook appears to still be invalid EML.

cboettig commented 7 years ago

This is probably gonna be tricky. Right, EML TextType elements use only a subset of DocBook, so in general my strategy of generating arbitrary DocBook from pandoc isn't going to be particularly robust. Perhaps it would be possible to define some filter (maybe in XSLT, maybe just in R) that could strip a DocBook file down into just the EML-recognized elements?

I'm not particularly happy about EML's partial docbook support, it seems like it would have been better to go for full DocBook (or perhaps html) so one could use those tools, or go for pure plain text. (I suppose the current strategy is ok if you are going EML -> DocBook, but just not okay the other way).

An alternate strategy for easily embedding richer text inputs (e.g. from Word) would be to use pandoc to convert to markdown, and then embed the markdown directly into the EML (e.g. as any other plain-text string). The nice bit about this approach is that the text remains readable to tools consuming / displaying EML that don't parse DocBook. (I suppose the metacat UI parses DocBook text nodes properly though?)

Hm, pandoc making invalid Docbook sounds like a separate issue. (Could be a version thing? though I'd have thought the xml namespaces were clear on that). Anyway, if pandoc really is producing invalid docbook it would be good to file that as an issue in pandoc repo (or google group; lots of smart devs over there who might be able to set us straight faster).

amoeba commented 7 years ago

I'm not particularly happy about EML's partial docbook support, it seems like it would have been better to go for full DocBook (or perhaps html) so one could use those tools, or go for pure plain text. (I suppose the current strategy is ok if you are going EML -> DocBook, but just not okay the other way).

Same here. I don't know what the rationale was, though I seem to recall that it was explained to me at some point.

(I suppose the metacat UI parses DocBook text nodes properly though?)

It does not seem to. It uses an XSLT that basically destroys the structure. I ran into all of this when trying to use itemized lists. Simple sections and paras work fine.

(EML 2 is in late-stages of development (see https://github.com/NCEAS/eml/projects/1) and I wonder what @mbjones would think about using Markdown instead.)

My feeling is that it'll be hard to find a solution here without writing something special and probably hacky. Stripping the docbook XML to make valid EML would remove information so I'd be more inclined to find something that modified the user's input at least. I'll think on this some more.

Hm, pandoc making invalid Docbook sounds like a separate issue.

Probably. I'll look over there to see if I'm just doing something wrong.

cboettig commented 7 years ago

Thanks! If even MetaCat displays TextTypes by stripping DocBook using XSLT transform rather than leveraging the DocBook, I'd be strongly inclined to have the EML R package to take a similar route, rather than introduce a lot of (even EML-compliant) DocBook into the metadata where it sounds like will primarily be a nuisance to consumers of the data.

I wonder if we could use the same XSLT transform or if it would have to be modified significantly to consume full DocBook.

I do appreciate that there's great value to having things like machine-readable methods / protocols sections in EML, and that it's hard to expect people to write those in plain text (or even in markdown, most people will use Word). There's a temptation to treat such long form documentation as external 'data' files, and maybe that's not so bad now that Word is XML based anyway, so I think there's a lot of value in being able to automate the process of creating rich text sections like methods from a Word doc, and obviously pandoc is one promising if far from perfect way to do that.

amoeba commented 7 years ago

The XSLT that powers the view MetacatUI produces is here: https://code.ecoinformatics.org/code/metacat/trunk/lib/style/skins/metacatui/eml-2/eml-text.xsl It basically works only with a section or a para and just pumps any other child content out raw (via apply-templates) which amounts to ignoring the structure but including the content.

I don't off-hand know how I might go after what you're thinking but I'm sure someone can come up with something workable. Given how the EML TextType is defined, it looks like a recursive descent parser could be used to wrap up nodes in <para> elements that isn't either section or para already and we'd get realllllly close to valid EML TextType if not all the way there.

As a thought experiment, how would we build this into EML if we were starting from scratch? Given the primacy of Markdown over other (arguably richer) markup languages, I would expect to use Markdown instead of DocBook.

amoeba commented 7 years ago

Aside: I just checked pandoc and it produces valid docbook, I just needed to specify the standalone flag: pandoc -s -f markdown -t docbook complex_abstract.md -o abstract.xml

cboettig commented 7 years ago

As a thought experiment, how would we build this into EML if we were starting from scratch? Given the primacy of Markdown over other (arguably richer) markup languages, I would expect to use Markdown instead of DocBook.

Good question. The real challenge is longevity here, which arguably matters much more on this side than on the code side. Markdown is still a somewhat loose/divided standard, and obviously it didn't exist when EML standard was created, and docbook had a lot of momentum (Duncan Temple Lang made something effectively the equivalent of RMarkdown based on R+docbook around then).

Of course the nice thing about markdown is that it's essential plain text, and there's obviously a strong argument for plain text as the ultimate universal archival type. I think my ideal would be to denote these as plain text fields (possibly section-delimited), with the ability to link / reference an external version with rich formatting (think word doc, web page or pdf that might display tables, equations, and other things you might find in methods but would be hard to show in plain text.) Obviously the risk is that laziness would mean people link in place of providing text (which couldn't be searched as easily then), but I think that problem is best solved by tools like the R package to generate good metadata more easily (after all, laziness will always mean less metadata anywhere).