Closed amoeba closed 5 years ago
I played around with this and I find that Pandoc produces invalid DocBook XML by default when converting from Markdown. I'm sure there's something I don't know that would be really helpful right now.
pandoc -f markdown -t docbook complex_abstract.md -o abstract.xml
xerces -v -s -hs abstract.xml
[Error] abstract.xml:4:26: Document root element "section", must match DOCTYPE root "article".
[Error] abstract.xml:59:11: The content of element type "section" must match "(sectioninfo?,(title,subtitle?,titleabbrev?),(toc|lot|index|glossary|bibliography)*,(((calloutlist|glosslist|bibliolist|itemizedlist|orderedlist|segmentedlist|simplelist|variablelist|caution|important|note|tip|warning|literallayout|programlisting|programlistingco|screen|screenco|screenshot|synopsis|cmdsynopsis|funcsynopsis|classsynopsis|fieldsynopsis|constructorsynopsis|destructorsynopsis|methodsynopsis|formalpara|para|simpara|address|blockquote|graphic|graphicco|mediaobject|mediaobjectco|informalequation|informalexample|informalfigure|informaltable|equation|example|figure|table|msgset|procedure|sidebar|qandaset|task|anchor|bridgehead|remark|highlights|abstract|authorblurb|epigraph|indexterm|beginpage)+,(refentry*|section*|simplesect*))|refentry+|section+|simplesect+),(toc|lot|index|glossary|bibliography)*)".
abstract.xml: 4025 ms (24 elems, 7 attrs, 203 spaces, 284 chars)
after some quick fixes I was able to make it valid. This was a bit of an aside because the valid docbook appears to still be invalid EML.
This is probably gonna be tricky. Right, EML TextType elements use only a subset of DocBook, so in general my strategy of generating arbitrary DocBook from pandoc
isn't going to be particularly robust. Perhaps it would be possible to define some filter (maybe in XSLT, maybe just in R) that could strip a DocBook file down into just the EML-recognized elements?
I'm not particularly happy about EML's partial docbook support, it seems like it would have been better to go for full DocBook (or perhaps html) so one could use those tools, or go for pure plain text. (I suppose the current strategy is ok if you are going EML -> DocBook, but just not okay the other way).
An alternate strategy for easily embedding richer text inputs (e.g. from Word) would be to use pandoc to convert to markdown, and then embed the markdown directly into the EML (e.g. as any other plain-text string). The nice bit about this approach is that the text remains readable to tools consuming / displaying EML that don't parse DocBook. (I suppose the metacat UI parses DocBook text nodes properly though?)
Hm, pandoc making invalid Docbook sounds like a separate issue. (Could be a version thing? though I'd have thought the xml namespaces were clear on that). Anyway, if pandoc really is producing invalid docbook it would be good to file that as an issue in pandoc repo (or google group; lots of smart devs over there who might be able to set us straight faster).
I'm not particularly happy about EML's partial docbook support, it seems like it would have been better to go for full DocBook (or perhaps html) so one could use those tools, or go for pure plain text. (I suppose the current strategy is ok if you are going EML -> DocBook, but just not okay the other way).
Same here. I don't know what the rationale was, though I seem to recall that it was explained to me at some point.
(I suppose the metacat UI parses DocBook text nodes properly though?)
It does not seem to. It uses an XSLT that basically destroys the structure. I ran into all of this when trying to use itemized lists. Simple sections and paras work fine.
(EML 2 is in late-stages of development (see https://github.com/NCEAS/eml/projects/1) and I wonder what @mbjones would think about using Markdown instead.)
My feeling is that it'll be hard to find a solution here without writing something special and probably hacky. Stripping the docbook XML to make valid EML would remove information so I'd be more inclined to find something that modified the user's input at least. I'll think on this some more.
Hm, pandoc making invalid Docbook sounds like a separate issue.
Probably. I'll look over there to see if I'm just doing something wrong.
Thanks! If even MetaCat displays TextTypes by stripping DocBook using XSLT transform rather than leveraging the DocBook, I'd be strongly inclined to have the EML
R package to take a similar route, rather than introduce a lot of (even EML-compliant) DocBook into the metadata where it sounds like will primarily be a nuisance to consumers of the data.
I wonder if we could use the same XSLT transform or if it would have to be modified significantly to consume full DocBook.
I do appreciate that there's great value to having things like machine-readable methods / protocols sections in EML, and that it's hard to expect people to write those in plain text (or even in markdown, most people will use Word). There's a temptation to treat such long form documentation as external 'data' files, and maybe that's not so bad now that Word is XML based anyway, so I think there's a lot of value in being able to automate the process of creating rich text sections like methods from a Word doc, and obviously pandoc is one promising if far from perfect way to do that.
The XSLT that powers the view MetacatUI produces is here: https://code.ecoinformatics.org/code/metacat/trunk/lib/style/skins/metacatui/eml-2/eml-text.xsl It basically works only with a section
or a para
and just pumps any other child content out raw (via apply-templates
) which amounts to ignoring the structure but including the content.
I don't off-hand know how I might go after what you're thinking but I'm sure someone can come up with something workable. Given how the EML TextType is defined, it looks like a recursive descent parser could be used to wrap up nodes in <para>
elements that isn't either section
or para
already and we'd get realllllly close to valid EML TextType if not all the way there.
As a thought experiment, how would we build this into EML if we were starting from scratch? Given the primacy of Markdown over other (arguably richer) markup languages, I would expect to use Markdown instead of DocBook.
Aside: I just checked pandoc and it produces valid docbook, I just needed to specify the standalone
flag: pandoc -s -f markdown -t docbook complex_abstract.md -o abstract.xml
As a thought experiment, how would we build this into EML if we were starting from scratch? Given the primacy of Markdown over other (arguably richer) markup languages, I would expect to use Markdown instead of DocBook.
Good question. The real challenge is longevity here, which arguably matters much more on this side than on the code side. Markdown is still a somewhat loose/divided standard, and obviously it didn't exist when EML standard was created, and docbook had a lot of momentum (Duncan Temple Lang made something effectively the equivalent of RMarkdown based on R+docbook around then).
Of course the nice thing about markdown is that it's essential plain text, and there's obviously a strong argument for plain text as the ultimate universal archival type. I think my ideal would be to denote these as plain text fields (possibly section-delimited), with the ability to link / reference an external version with rich formatting (think word doc, web page or pdf that might display tables, equations, and other things you might find in methods but would be hard to show in plain text.) Obviously the risk is that laziness would mean people link in place of providing text (which couldn't be searched as easily then), but I think that problem is best solved by tools like the R package to generate good metadata more easily (after all, laziness will always mean less metadata anywhere).
Given a markdown doc, 'complex_abstract.md':
the following R script indicates the resulting EML document is not valid:
You can see what's going on if you view the serialized XML:
My assessment is that element
itemizedlist
andorderedlist
need to go insidepara
elements always.