jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.35k stars 3.37k forks source link

JATS Reader: Odd behavior with list and block quotes when nested inside a paragraph #8804

Closed coryschires closed 1 year ago

coryschires commented 1 year ago

Background

This is a doozy and probably something y'all have encountered in the past – tho I could not find an issue so here we go...

In JATS, both <list> and <disp-quote> elements may be contained inside a <p>. In the context of a research article (and perhaps more broadly) this makes perfect sense: It's not uncommon / wrong for an article to include a block quote or list inside a paragraph (as distinct from in between two paragraphs).

But HTML, for example, doesn't allow this (see https://stackoverflow.com/questions/70604379). Furthermore, Markdown doesn't have, afaik, any official (or even conventional) syntax for expressing this use case. In other words, HTML and MD have the same bias / limitation.

Problem

This esoteric nonsense starts to matter when attempting to convert JATS to Markdown. Consider the following examples.

Case 1: Block quotes

Given the following JATS:

<p>
  Case 1: Start of a paragraph
  <disp-quote>
    <p>My block quote</p>
  </disp-quote>
  End of a paragraph
</p>

When I run pandoc -f jats -t markdown Then Pandoc produces the following Markdown

Case 2: Start of a paragraph " My block quote " End of a paragraph

I can see Pandoc added some " marks, which suggests it's doing its best given the circumstances.

But, imho, it would better if Pandoc block-element-ified the nested block quote by instead producing the following Markdown:

Case 2: Start of a paragraph

> My block quote

End of a paragraph

Case 1: Lists

Likewise, I see similar behavior when dealing with a <list> nested inside <p>.

Given the following JATS:

<p>
  Case 2: Start of a paragraph
  <list list-type="bullet">
    <list-item>
      <label>&#8226;</label>
      <p>Red</p>
    </list-item>
    <list-item>
      <label>&#8226;</label>
      <p>Blue</p>
    </list-item>
  </list>
</p>

When I run pandoc -f jats -t markdown Then Pandoc produces the following Markdown

Case 2: Start of a paragraph • Red • Blue

And, again, I think it would be better if Pandoc instead produced:

Example A: Start of a paragraph

-   Red
-   Blue

Steps to recreate

Pandoc Version

pandoc 3.1
Features: +server +lua
Scripting engine: Lua 5.4

Sample JATS XML files

nested-block-quote.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublishing3.dtd">
<article article-type="research-article" dtd-version="3.0" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
  <body>
    <sec>
      <title>Nested Block Quote</title>
      <p>
        Case 1: Start of a paragraph
        <disp-quote>
          <p>My block quote</p>
        </disp-quote>
        End of a paragraph
      </p>
    </sec>
  </body>
</article>

nested-list.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublishing3.dtd">
<article article-type="research-article" dtd-version="3.0" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
  <body>
    <sec>
      <title>Nested list</title>
      <p>
        Case 2: Start of a paragraph
        <list list-type="bullet">
          <list-item>
            <label>&#8226;</label>
            <p>Red</p>
          </list-item>
          <list-item>
            <label>&#8226;</label>
            <p>Blue</p>
          </list-item>
        </list>
      </p>
    </sec>
  </body>
</article>

Commands to reproduce

pandoc nested-block-quote.xml -f jats -t markdown -o nested-block-quote.md
pandoc nested-list.xml -f jats -t markdown -o nested-list.md

As always, thanks for your time and insight! Pandoc is such a great tool. :pray:

kamoe commented 1 year ago

I think the rootcause here is that Pandoc seems to assume JATS paragraphs <p> can only contain inline elements (as opposed to block elements). In other words, it interprets both <list> and <disp-quote> are inline elements.

I say this from observing the isBlockElement function, where inline elements are pretty much the same list of elements allowed inside a paragraph (as defined in the JATS spec here):

https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L117-L131

I think... a paragraph cannot contain block elements in the AST? (As per https://hackage.haskell.org/package/pandoc-types-1.23/docs/src/Text.Pandoc.Definition.html#Para, Para is a list of Inlines) In that sense I'm not sure how much we can do here (but somebody please confirm). Unless, you are open to structure the above in several paragraphs.

kamoe commented 1 year ago

Actually, @coryschires I just flagged this as a more general issue in https://github.com/jgm/pandoc/issues/8889.

kamoe commented 1 year ago

Solved with solution to https://github.com/jgm/pandoc/issues/8889.