Closed coryschires closed 1 year ago
I think the rootcause here is that Pandoc seems to assume JATS paragraphs <p>
can only contain inline elements (as opposed to block elements). In other words, it interprets both <list>
and <disp-quote>
are inline elements.
I say this from observing the isBlockElement
function, where inline elements are pretty much the same list of elements allowed inside a paragraph (as defined in the JATS spec here):
I think... a paragraph cannot contain block elements in the AST? (As per https://hackage.haskell.org/package/pandoc-types-1.23/docs/src/Text.Pandoc.Definition.html#Para, Para
is a list of Inline
s) In that sense I'm not sure how much we can do here (but somebody please confirm). Unless, you are open to structure the above in several paragraphs.
Actually, @coryschires I just flagged this as a more general issue in https://github.com/jgm/pandoc/issues/8889.
Solved with solution to https://github.com/jgm/pandoc/issues/8889.
Background
This is a doozy and probably something y'all have encountered in the past – tho I could not find an issue so here we go...
In JATS, both
<list>
and<disp-quote>
elements may be contained inside a<p>
. In the context of a research article (and perhaps more broadly) this makes perfect sense: It's not uncommon / wrong for an article to include a block quote or list inside a paragraph (as distinct from in between two paragraphs).But HTML, for example, doesn't allow this (see https://stackoverflow.com/questions/70604379). Furthermore, Markdown doesn't have, afaik, any official (or even conventional) syntax for expressing this use case. In other words, HTML and MD have the same bias / limitation.
Problem
This esoteric nonsense starts to matter when attempting to convert JATS to Markdown. Consider the following examples.
Case 1: Block quotes
Given the following JATS:
When I run
pandoc -f jats -t markdown
Then Pandoc produces the following MarkdownI can see Pandoc added some
"
marks, which suggests it's doing its best given the circumstances.But, imho, it would better if Pandoc block-element-ified the nested block quote by instead producing the following Markdown:
Case 1: Lists
Likewise, I see similar behavior when dealing with a
<list>
nested inside<p>
.Given the following JATS:
When I run
pandoc -f jats -t markdown
Then Pandoc produces the following MarkdownAnd, again, I think it would be better if Pandoc instead produced:
Steps to recreate
Pandoc Version
Sample JATS XML files
nested-block-quote.xml
nested-list.xml
Commands to reproduce
As always, thanks for your time and insight! Pandoc is such a great tool. :pray: