Functions isBlockElement and parseMixed cancel each other out; JATS body blocks are treated as inlines

kamoe commented 1 year ago

Background

In JATS, there exists a number of paragraph-level or body blocks, and other structurally similar elements, that sit at the same level of a .

By definition from the JATS specification, they are "elements, such as tables and figures, that are content units separated from other content visually and logically, typically with whitespace before and after them. These elements are typically used in the same places a paragraph may be used, for example, inside a section after the section title."

Now, in JATS, the element  can contain some of such block-like elements, e.g. <code>:

<p>
   Start of a paragraph
    <code>Some code</code>
   End of a paragraph
</p>

Because <code> is supposed to sit at paragraph-level, the above should look like three independent paragraphs separated by a whitespace, although in JATS it is all just one paragraph.

Pandoc seems to have taken this into account, with the parseMixed function of the JATS reader, where every element is parsed either as a block, or as an inline, as appropriate:

https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L202-L211

According tho the above, when a block-like element is found, the reader parses it as a block, creating a separate paragraph for it. Otherwise, the reader parses the element as an inline.

The problem The output of the isBlockElement function determines if the element gets to be parsed as a block or as an inline (see L203 in code extract above). However, the isBlockElement function actually only returns TRUE if the input element is a  (See discussion here). Since isBlockElement is only ever called over children of *, isBlockElement is never TRUE**, and as a result, all children of  are always and systematically parsed as inlines (L208-L2011 above are never reached).

So the whole purpose of the parsedMixed function is defeated. This is causing issue https://github.com/jgm/pandoc/issues/8804, for instance.

*This is because isBlockElement is only ever called from parseMixed, and parseMixed is only ever called from the case of parsing . ** This is because in the JATS specification  cannot contain a , in other words, no input element of isBlockElement is ever a  that could yield a value TRUE.

The root cause It all boils down to a likely mixup when the isBlockElement function was originally adapted from the DockBook reader (do scroll down to see it all):

https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L106-L132

The function lists the elements that should be considered as paragraph-level elements, then filters out any inline element. The problem is, it defines inline elements as simply any elements contained inside a paragraph (the list called inlinetags is an exact copy of all elements that can be contained inside the JATS 1.1 spec of ). The problem of this approach to define inline elements is that it does not acknowledge body blocks and other paragraph-level elements that can exist inside  elements, as defined at the beginning of this post.

The only survivor of this filter is the element  itself (which is useless in this particular context).

The solution A trivial solution would be to not filter out the inline elements from the allowed block elements in isBlockElement (This is achieved by removing L117-L131, and \\ S.fromList inlinetags from L108).

But this might not be a trivial problem, and I might as well be missing something from the history of the isBlockElement function.

Thoughts, anyone?

kamoe commented 1 year ago

@tarleb, @hamishmack, @jgm do you have any insights on why this happens?

TLDR; the functions isBlockElement and parseMixed cancel each other out in the JATS reader, because:

isBlockElement always outputs FALSE*.
parseMixed always outputs inlines**.

Therefore, as they are written now, neither of these two functions has any functional purpose. Instead of calling parseMixed the parsing case for  could just as well parse the contents of  directly as inlines (instead of going through a function that does nothing but just output inlines). Unless, that is not the purpose, and we don't really want the output of isBlockElement to always be FALSE, and parseMixed to always output inlines?

Your thoughts really appreciated.

* This is because it can only output TRUE if the input is  (Confirmation here). Alas, it is only ever called from parseMixed, and parseMixed is only ever called from the parse case of . This means the input of isBlockElement is always a child of  which, by JATS specification, can never be . ** This is because the result of isBockElement is always FALSE.

jgm commented 1 year ago

Something very weird about the definition of isBlockElement! I see what you mean: all the block-level tags are reproduced in inlinetags... I don't know why this was done, but hopefully someone involved in this part of the code can comment on the history.

kamoe commented 1 year ago

@jgm After A LOT of analysis I came to the following conclusions, as to what the rationale was when the above code was written; and best next-steps:

Points 1. and 2. below refer to the JATS 1.1. specification, the latest specification when the code was pushed in 2017(JATS 1.2. appeared in 2019, and JATS 1.3. in 2021).

Line numbers refer to the JATS reader, JATS.hs in 16f28ef.

Block elements (the lists paragraphLevel, lists, mathML, and other, L109-L116) are defined as the elements contained in the "Any combination of" group of elements in <body>(I conclude this, as an identical list of elements, in the same order, is on the JATS 1.1. specification for <body>). This list of elements is common and recycled across many section-like elements throughout the JATS specification, so it makes sense as a reference.
Inline elements (the list inlinetags, L117-L131) are defined as the elements contained in  (I conclude this, as an identical list of elements, in the same order, is on the JATS 1.1. specification for ).
It is reasonable to interpret that the intention of the list subtraction in L108 was to filter out inline elements from the list of block elements.
As explained in my post above, the criteria to define inline elements described in No. 2. is incorrect. A new criteria is proposed, where all elements in the list of block elements defined in No. 1. (and also in No. 5. below) are examined in detail against the JATS specification, to determine if they should be processed as an inline element. Only those elements deemed inline-exclusive after this analysis should be contained in the inlinetags list. The proposal is to only keep: alternatives, mml:math, tex-math, and x; rationale is here.
In addition, the general solution should be updated to JATS 1.3. to reflect current version and to align with https://github.com/jgm/pandoc/pull/8840, which already incorporated JATS 1.3. elements. This amounts to add answer, answer-set, explanation, question, question-wrap, and question-wrap-group, to the block list.

Unless @tarleb , @hamishmack , yourself, or anyone else has an objection, I will prepare a fix for the above. Otherwise do share your thoughts.

jgm commented 1 year ago

Thanks for this detailed analysis! I have a couple questions about the rationales on the linked spreadsheet.

code can be inline - This is the standard way to represent code in an inline context (at least, it's what the jats writer uses). So in general I think this should be parsed as a pandoc Code inline element. These can contain newlines, which will in general be preserved in the output format. However, some code elements should be parsed as CodeBlock. It depends on the context. If you have a <code> inside a paragraph or other inline context, it should be inline Code, otherwise CodeBlock.
disp-formula should be inline - This is what we use for display math in the jats writer, and Math elements (both inline and display) are Inline constructors for pandoc. Also disp-formula-group I think.
related-object etc. - you say that this is by definition block because it can contain line breaks. I'm not quite sure what the reasoning is here, since we have a LineBreak inline element. But in any case, the spec is clear that this can be contained in a  - and when it is, it should be parsed as an inline element.

I may be misunderstanding the broader context -- I don't have a good picture of how this parser works. But in general, I would think that we should be guided by the following criterion: If the JATS spec says that an element can appear inside , then it should be parsed as inline when it occurs inside  (or any other container that can only contain inline content). This may be consistent with what you're thinking, but I wasn't sure.

kamoe commented 1 year ago

Thanks for your quick input!

Regarding specific ways to treat specific elements, I do not feel strongly one way or the other, I suppose what is important is that everything is explained and makes sense. On that, I am happy to elaborate on your questions:

I agree code can be inline, it was probably not the best example here. My two concerns with it were: a) the requirement for it to behave like preformat and preserve line endings (but If this can be done inline, then fine); and b) what if the code content exceeds a reasonable one liner in length? Is this something the parser should detect and address? (could be programable to display one way or the other, but happy to go with the easiest way and declare it inline by default).
According to the JATS spec, disp-formula is a "Mathematical equation, expression, or formula that is to be displayed as a block (callout) within the narrative flow". For inline forumale, one should use the inline-formula element. Happy to stick to pandoc-specific constraints, but just want to understand why pandoc treats this all as inline?
OK for related-object to be treated as in line, concerned addressed.

Now, it is interesting you put in writing the criterion, since that actually explains why we have this issue. In JATS, I do not think we can assume that anything that can appear inside a  should be parsed as an inline when it occurs inside . That is exactly why the isBlockElement function ended up removing all qualifying block elements (they can all appear inside , and they are all effectively inside a  since isBlockElement is only called from children of ). What I tried to explain at the beginning of this post is, precisely, that some children of  should be treated as blocks, not inlines. See issue https://github.com/jgm/pandoc/issues/8804 concerning disp-quote for an example of how that assumption creates problems.

I suggest, In JATS, we do not make this assumption, and rather reduce that list of elements that will be treated as inlines. After your input, I believe this list would now contain alternatives, code, mml:math, tex-math, related-object, and x, happy to further add to it (e.g. potentially also disp-formula) as this discussion progresses.

jgm commented 1 year ago

what if the code content exceeds a reasonable one liner in length? Is this something the parser should detect and address?

I was thinking that the most reasonable approach would be to take it as inline if it occurs inside the context of a  element. Otherwise we're changing the semantics, since the JATS document says there is one paragraph and we'd be splitting it into two paragraphs and a code block. But if it's normal to include multiline code samples that are intended to be displayed as a block inside a JATS  -- and maybe that's what you're saying here -- I'd have to revise that view.

As for disp-formula: pandoc doesn't have a block-level construct for Math, just a Math constructor for Inline that has two variants, InlineMath and DisplayMath. In this case you'll use DisplayMath. Don't worry: it will still be displayed as a separate block in e.g. LaTeX, HTML, or docx output. So I'm confident in this case that treating it as inline is the right approach

kamoe commented 1 year ago

But if it's normal to include multiline code samples that are intended to be displayed as a block inside a JATS  -- and maybe that's what you're saying here -- I'd have to revise that view.

This is what I am saying, yes.

If pandoc assumes everything that can appear inside a  should be treated as inline if occurring inside a , then what is the point of parseMixed, in particular, lines 208-211, which are never reached?

https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L166

https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L202-L211

jgm commented 1 year ago

OK, I see, yes, that must be the intent of parseMixed (I didn't write this code and I'm not too familiar with JATS).

Question: currently the JATS writer uses <code> for inline code that is marked with a language and hence syntax-highlighted, and we use <monospace> for inline code that isn't marked with a language. Is the use of <code> for inline code just wrong? (That is, would JATS processors typically format this as a set-off block?) If so, we should change the JATS writer, and then in the reader we can treat <code> as block-level always. But if <code> is sometimes endered as inline by typical JATS processors, we might need to do something conditional.

As for disp-formula: Looking at the spec for that, I see that it can contain many kinds of elements, including an abstract, regular text, a graphic, etc. So I think we need to look at the content of the element to see how to handle it:

If the content is a <tex-math> or <mml:math> element, or an <alternatives> with one of these as a child, then create a Math DisplayMath inline element. If content is tex-math, the content of this element can be the tex math; if it's mml:math, then we need to use texmath library functions to convert the MathML to TeX.
If the content is more complex, e.g. an image or text or one of the above plus, say, an abstract, then we can treat it as a block element, putting it in a pandoc Div.

How does that sound?

kamoe commented 1 year ago

Is the use of <code> for inline code just wrong? (That is, would JATS processors typically format this as a set-off block?)

I think so.

Taylor & Francis' guide to JATS does explicitly say not to use <code> for inline content:

The <code> element is block level and is not intended to be used inline within standard text.

The JATS spec recommends <monospace> for inlines and <code> for blocks:

The <monospace> is used for monospaced words that are inline with other text, for example, computer code fragments, parameters and operators, etc. For block monospace elements, particularly where spaces and line breaks also need to be preserved, use either the generic block structural element <preformat> (which can hold ASCII art, man-machine dialog, or shape poetry) or the semantically explicit element <code> (which holds script or computer coding examples).

jgm commented 1 year ago

I'll change the writer so it doesn't use Code for inline code.

And on your side you can just treat <code> as always block-level.

kamoe commented 1 year ago

Sounds good.

Regarding disp-formula, would it not be easier to treat it as a block, and delegate special treatment to its children? We will keep mml:math and tex-math in the list of inlinetags anyway, so when the parser comes for them they will behave as inlines. It does look like disp-formula it is supposed to be a block either way: a block that contains an inline, or a block that contains many things.

To recap, all I will do is simplify the long list of inlinetags which previously contained everything that could appear in a  to only contain: mml:math, tex-math, related-object, and x

(I removed alternatives, since that can have code, and other block elements, and the same principle of letting children handle behaviour would apply).

Would this be a sensible approach?

jgm commented 1 year ago

In pandoc, the normal way of writing block math is

Einstein showed that $$e=mc^2$$. This formula is important because...

This is parsed as

[ Para
    [ Str "Einstein"
    , Space
    , Str "showed"
    , Space
    , Str "that"
    , Space
    , Math DisplayMath "e=mc^2" 
    , Str "."
    , Space
    , Str "This"
    , Space
    , Str "formula"
    , Space
    , Str "is"
    , Space
    , Str "important"
    , Space
    , Str "because\8230"
    ]
]

So there is no separate block for the display formula. If we treated disp-formula as a block (say, a special div), then we'd end up with a new paragraph after the formula, which isn't what is wanted. (For example, in processing with LaTeX, if you had a new paragraph after the formula you'd get unwanted indentation.)

When you pass the above through the JATS writer you get:

<p>Einstein showed that <disp-formula><alternatives>
<tex-math><![CDATA[e=mc^2]]></tex-math>
<mml:math display="block" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mi>m</mml:mi><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:math></alternatives></disp-formula>.
This formula is important because…</p>

and it is important that passing this JATS back through the pandoc reader should parse the same way as the original.

That is why disp-formula needs special conditional treatment. Or, just always treat it as inline and ignore contents that can't be represented that way -- that would be better than taking it as a block.

It seems to me that alternatives also needs special treatment, since the contents may be inline and in that case we wouldn't want to start a new block. Also, note that a disp-formula may contain both tex math and mml math inside an alternativse, and we want to choose only one.

kamoe commented 1 year ago

Why does the native representation of the formula above use DisplayMath? Why is it not:

[ Para
    [ Str "Einstein"
    , Space
    , Str "showed"
    , Space
    , Str "that"
    , Space
    , Math InlineMath "e=mc^2" 
    , Str "."
    , Space
    , Str "This"
    , Space
    , Str "formula"
    , Space
    , Str "is"
    , Space
    , Str "important"
    , Space
    , Str "because\8230"
    ]
]

jgm commented 1 year ago

Because in pandoc markdown $$x$$ means display math and $x$ means inline math (as in TeX).

jgm commented 1 year ago

Semantically the formula is part of the paragraph. Even if it gets rendered as a separated block, we don't start a new paragraph after the formula but continue the preceding paragraph. (Hence, no indent if paragraph indentation is called for in the style.)

kamoe commented 1 year ago

Got it.

To recap, in the JATS reader:

Simplify the long list of inlinetags which previously contained everything that could appear in a  to only contain: mml:math, tex-math, related-object, and x
Add logic to handle disp-formula and alternatives as either block or inline, depending on their contents.

Have we covered it?

jgm commented 1 year ago

I think I'm still confused -- I'm sorry if I'm missing the context.

Certainly you don't mean to say that the only tags that should be parsed as inline elements (e.g. in a Para or Header text) are mml:math, tex-math, related-object, and x? What about, e.g., bold, italic, strike, inline-graphic, etc.? What am I missing?

kamoe commented 1 year ago

No worries. That is not what I mean.

In the context of isBlockElement, inlinetags isn't supposed to be a comprehensive list of all inline elements, but a list of exceptions to the list of all candidate block elements (defined as elements that occur directly inside <body>). Inline-only elements like <bold>, <italics>, etc. do not occur directly inside <body>, and thus are not part of any of the list of candidate block elements to start with (they are not in any of the lists paragraphLevel, lists, mathML, or other), so it is pointless to add them to the exception list. https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L107-L108

Whatever we do to the list inlinetags, applying isBlockElement to bold or italic, it will output FALSE anyway. I could rename inlinetags to exceptions so that this is less confusing.

jgm commented 1 year ago

OK, thanks for the clarification. The code might be a bit unnecessarily confusing with the name inlineTags. Perhaps canBeInline or something would be clearer?

kamoe commented 1 year ago

Absolutely.

To recap, in the isBlockElement function in the JATS reader:

Rename the list inlinetags to canBeInline
Simplify said list, which previously contained everything that could appear in a , to only contain: mml:math, tex-math, related-object, and x
Add logic to handle disp-formula and alternatives as either block or inline, depending on their contents.

If no further comments, or objections, I'll prepare a fix.

jgm commented 1 year ago

Looks good!

Of course it would be good to have some tests so we can see clearly the effects of this change.

kamoe commented 1 year ago

Alright, here is the fix: https://github.com/jgm/pandoc/pull/8971

It is failing 1 check, though. I am not sure if that is critical or normal. @jgm, do let me know if I should be doing something else.

The test output was as expected, and minor discrepancies exist with the previous test output file because the test file (test/jats-reader.xml) was not JATS compliant and I cleaned it up.

jgm / pandoc

Functions isBlockElement and parseMixed cancel each other out; JATS body blocks are treated as inlines #8889