Closed kamoe closed 1 year ago
@tarleb, @hamishmack, @jgm do you have any insights on why this happens?
TLDR; the functions isBlockElement
and parseMixed
cancel each other out in the JATS reader, because:
isBlockElement
always outputs FALSE*. parseMixed
always outputs inlines**. Therefore, as they are written now, neither of these two functions has any functional purpose. Instead of calling parseMixed
the parsing case for <p>
could just as well parse the contents of <p>
directly as inlines (instead of going through a function that does nothing but just output inlines). Unless, that is not the purpose, and we don't really want the output of isBlockElement
to always be FALSE, and parseMixed
to always output inlines?
Your thoughts really appreciated.
* This is because it can only output TRUE if the input is <p>
(Confirmation here). Alas, it is only ever called from parseMixed
, and parseMixed
is only ever called from the parse case of <p>
. This means the input of isBlockElement
is always a child of <p>
which, by JATS specification, can never be <p>
.
** This is because the result of isBockElement
is always FALSE.
Something very weird about the definition of isBlockElement! I see what you mean: all the block-level tags are reproduced in inlinetags
... I don't know why this was done, but hopefully someone involved in this part of the code can comment on the history.
@jgm After A LOT of analysis I came to the following conclusions, as to what the rationale was when the above code was written; and best next-steps:
Points 1. and 2. below refer to the JATS 1.1. specification, the latest specification when the code was pushed in 2017(JATS 1.2. appeared in 2019, and JATS 1.3. in 2021).
Line numbers refer to the JATS reader, JATS.hs in 16f28ef.
paragraphLevel
, lists
, mathML
, and other
, L109-L116) are defined as the elements contained in the "Any combination of" group of elements in <body>
(I conclude this, as an identical list of elements, in the same order, is on the JATS 1.1. specification for <body>
). This list of elements is common and recycled across many section-like elements throughout the JATS specification, so it makes sense as a reference.inlinetags
, L117-L131) are defined as the elements contained in <p>
(I conclude this, as an identical list of elements, in the same order, is on the JATS 1.1. specification for <p>
).inlinetags
list. The proposal is to only keep: alternatives
, mml:math
, tex-math
, and x
; rationale is here.answer
, answer-set
, explanation
, question
, question-wrap
, and question-wrap-group
, to the block list.Unless @tarleb , @hamishmack , yourself, or anyone else has an objection, I will prepare a fix for the above. Otherwise do share your thoughts.
Thanks for this detailed analysis! I have a couple questions about the rationales on the linked spreadsheet.
code
can be inline - This is the standard way to represent code in an inline context (at least, it's what the jats writer uses). So in general I think this should be parsed as a pandoc Code
inline element. These can contain newlines, which will in general be preserved in the output format. However, some code
elements should be parsed as CodeBlock
. It depends on the context. If you have a <code>
inside a paragraph or other inline context, it should be inline Code
, otherwise CodeBlock
.
disp-formula
should be inline - This is what we use for display math in the jats writer, and Math elements (both inline and display) are Inline constructors for pandoc. Also disp-formula-group
I think.
related-object
etc. - you say that this is by definition block because it can contain line breaks. I'm not quite sure what the reasoning is here, since we have a LineBreak inline element. But in any case, the spec is clear that this can be contained in a <p>
- and when it is, it should be parsed as an inline element.
I may be misunderstanding the broader context -- I don't have a good picture of how this parser works. But in general, I would think that we should be guided by the following criterion: If the JATS spec says that an element can appear inside <p>
, then it should be parsed as inline when it occurs inside <p>
(or any other container that can only contain inline content). This may be consistent with what you're thinking, but I wasn't sure.
Thanks for your quick input!
Regarding specific ways to treat specific elements, I do not feel strongly one way or the other, I suppose what is important is that everything is explained and makes sense. On that, I am happy to elaborate on your questions:
I agree code
can be inline, it was probably not the best example here. My two concerns with it were: a) the requirement for it to behave like preformat
and preserve line endings (but If this can be done inline, then fine); and b) what if the code content exceeds a reasonable one liner in length? Is this something the parser should detect and address? (could be programable to display one way or the other, but happy to go with the easiest way and declare it inline by default).
According to the JATS spec, disp-formula
is a "Mathematical equation, expression, or formula that is to be displayed as a block (callout) within the narrative flow". For inline forumale, one should use the inline-formula
element. Happy to stick to pandoc-specific constraints, but just want to understand why pandoc treats this all as inline?
OK for related-object
to be treated as in line, concerned addressed.
Now, it is interesting you put in writing the criterion, since that actually explains why we have this issue. In JATS, I do not think we can assume that anything that can appear inside a <p>
should be parsed as an inline when it occurs inside <p>
. That is exactly why the isBlockElement
function ended up removing all qualifying block elements (they can all appear inside <p>
, and they are all effectively inside a <p>
since isBlockElement
is only called from children of <p>
). What I tried to explain at the beginning of this post is, precisely, that some children of <p>
should be treated as blocks, not inlines. See issue https://github.com/jgm/pandoc/issues/8804 concerning disp-quote
for an example of how that assumption creates problems.
I suggest, In JATS, we do not make this assumption, and rather reduce that list of elements that will be treated as inlines. After your input, I believe this list would now contain alternatives
, code
, mml:math
, tex-math
, related-object
, and x
, happy to further add to it (e.g. potentially also disp-formula
) as this discussion progresses.
what if the code content exceeds a reasonable one liner in length? Is this something the parser should detect and address?
I was thinking that the most reasonable approach would be to take it as inline if it occurs inside the context of a <p>
element. Otherwise we're changing the semantics, since the JATS document says there is one paragraph and we'd be splitting it into two paragraphs and a code block. But if it's normal to include multiline code samples that are intended to be displayed as a block inside a JATS <p>
-- and maybe that's what you're saying here -- I'd have to revise that view.
As for disp-formula: pandoc doesn't have a block-level construct for Math, just a Math constructor for Inline that has two variants, InlineMath and DisplayMath. In this case you'll use DisplayMath. Don't worry: it will still be displayed as a separate block in e.g. LaTeX, HTML, or docx output. So I'm confident in this case that treating it as inline is the right approach
But if it's normal to include multiline code samples that are intended to be displayed as a block inside a JATS
<p>
-- and maybe that's what you're saying here -- I'd have to revise that view.
This is what I am saying, yes.
If pandoc assumes everything that can appear inside a <p>
should be treated as inline if occurring inside a <p>
, then what is the point of parseMixed
, in particular, lines 208-211, which are never reached?
OK, I see, yes, that must be the intent of parseMixed
(I didn't write this code and I'm not too familiar with JATS).
Question: currently the JATS writer uses <code>
for inline code that is marked with a language and hence syntax-highlighted, and we use <monospace>
for inline code that isn't marked with a language. Is the use of <code>
for inline code just wrong? (That is, would JATS processors typically format this as a set-off block?) If so, we should change the JATS writer, and then in the reader we can treat <code>
as block-level always. But if <code>
is sometimes endered as inline by typical JATS processors, we might need to do something conditional.
As for disp-formula
: Looking at the spec for that, I see that it can contain many kinds of elements, including an abstract, regular text, a graphic, etc. So I think we need to look at the content of the element to see how to handle it:
<tex-math>
or <mml:math>
element, or an <alternatives>
with one of these as a child, then create a Math DisplayMath inline element. If content is tex-math
, the content of this element can be the tex math; if it's mml:math
, then we need to use texmath library functions to convert the MathML to TeX.How does that sound?
Is the use of
<code>
for inline code just wrong? (That is, would JATS processors typically format this as a set-off block?)
I think so.
Taylor & Francis' guide to JATS does explicitly say not to use <code>
for inline content:
The
<code>
element is block level and is not intended to be used inline within standard text.
The JATS spec recommends <monospace>
for inlines and <code>
for blocks:
The
<monospace>
is used for monospaced words that are inline with other text, for example, computer code fragments, parameters and operators, etc. For block monospace elements, particularly where spaces and line breaks also need to be preserved, use either the generic block structural element<preformat>
(which can hold ASCII art, man-machine dialog, or shape poetry) or the semantically explicit element<code>
(which holds script or computer coding examples).
I'll change the writer so it doesn't use Code for inline code.
And on your side you can just treat <code>
as always block-level.
Sounds good.
Regarding disp-formula
, would it not be easier to treat it as a block, and delegate special treatment to its children? We will keep mml:math
and tex-math
in the list of inlinetags
anyway, so when the parser comes for them they will behave as inlines. It does look like disp-formula
it is supposed to be a block either way: a block that contains an inline, or a block that contains many things.
To recap, all I will do is simplify the long list of inlinetags
which previously contained everything that could appear in a <p>
to only contain: mml:math
, tex-math
, related-object
, and x
(I removed alternatives
, since that can have code
, and other block elements, and the same principle of letting children handle behaviour would apply).
Would this be a sensible approach?
In pandoc, the normal way of writing block math is
Einstein showed that $$e=mc^2$$. This formula is important because...
This is parsed as
[ Para
[ Str "Einstein"
, Space
, Str "showed"
, Space
, Str "that"
, Space
, Math DisplayMath "e=mc^2"
, Str "."
, Space
, Str "This"
, Space
, Str "formula"
, Space
, Str "is"
, Space
, Str "important"
, Space
, Str "because\8230"
]
]
So there is no separate block for the display formula. If we treated disp-formula as a block (say, a special div), then we'd end up with a new paragraph after the formula, which isn't what is wanted. (For example, in processing with LaTeX, if you had a new paragraph after the formula you'd get unwanted indentation.)
When you pass the above through the JATS writer you get:
<p>Einstein showed that <disp-formula><alternatives>
<tex-math><![CDATA[e=mc^2]]></tex-math>
<mml:math display="block" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mi>m</mml:mi><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:math></alternatives></disp-formula>.
This formula is important because…</p>
and it is important that passing this JATS back through the pandoc reader should parse the same way as the original.
That is why disp-formula needs special conditional treatment. Or, just always treat it as inline and ignore contents that can't be represented that way -- that would be better than taking it as a block.
It seems to me that alternatives
also needs special treatment, since the contents may be inline and in that case we wouldn't want to start a new block. Also, note that a disp-formula may contain both tex math and mml math inside an alternativse, and we want to choose only one.
Why does the native representation of the formula above use DisplayMath
? Why is it not:
[ Para [ Str "Einstein" , Space , Str "showed" , Space , Str "that" , Space , Math InlineMath "e=mc^2" , Str "." , Space , Str "This" , Space , Str "formula" , Space , Str "is" , Space , Str "important" , Space , Str "because\8230" ] ]
Because in pandoc markdown $$x$$
means display math and $x$
means inline math (as in TeX).
Semantically the formula is part of the paragraph. Even if it gets rendered as a separated block, we don't start a new paragraph after the formula but continue the preceding paragraph. (Hence, no indent if paragraph indentation is called for in the style.)
Got it.
To recap, in the JATS reader:
Simplify the long list of inlinetags
which previously contained everything that could appear in a <p>
to only contain: mml:math
, tex-math
, related-object
, and x
Add logic to handle disp-formula
and alternatives
as either block or inline, depending on their contents.
Have we covered it?
I think I'm still confused -- I'm sorry if I'm missing the context.
Certainly you don't mean to say that the only tags that should be parsed as inline elements (e.g. in a Para or Header text) are mml:math, tex-math, related-object, and x? What about, e.g., bold, italic, strike, inline-graphic, etc.? What am I missing?
No worries. That is not what I mean.
In the context of isBlockElement
, inlinetags
isn't supposed to be a comprehensive list of all inline elements, but a list of exceptions to the list of all candidate block elements (defined as elements that occur directly inside <body>
). Inline-only elements like <bold>
, <italics>
, etc. do not occur directly inside <body>
, and thus are not part of any of the list of candidate block elements to start with (they are not in any of the lists paragraphLevel
, lists
, mathML
, or other
), so it is pointless to add them to the exception list.
https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L107-L108
Whatever we do to the list inlinetags
, applying isBlockElement
to bold
or italic
, it will output FALSE anyway. I could rename inlinetags
to exceptions
so that this is less confusing.
OK, thanks for the clarification. The code might be a bit unnecessarily confusing with the name inlineTags
. Perhaps canBeInline
or something would be clearer?
Absolutely.
To recap, in the isBlockElement
function in the JATS reader:
Rename the list inlinetags
to canBeInline
Simplify said list, which previously contained everything that could appear in a <p>
, to only contain: mml:math
, tex-math
, related-object
, and x
Add logic to handle disp-formula
and alternatives
as either block or inline, depending on their contents.
If no further comments, or objections, I'll prepare a fix.
Looks good!
Of course it would be good to have some tests so we can see clearly the effects of this change.
Alright, here is the fix: https://github.com/jgm/pandoc/pull/8971
It is failing 1 check, though. I am not sure if that is critical or normal. @jgm, do let me know if I should be doing something else.
The test output was as expected, and minor discrepancies exist with the previous test output file because the test file (test/jats-reader.xml) was not JATS compliant and I cleaned it up.
Background
In JATS, there exists a number of paragraph-level or body blocks, and other structurally similar elements, that sit at the same level of a
<p>
.By definition from the JATS specification, they are "elements, such as tables and figures, that are content units separated from other content visually and logically, typically with whitespace before and after them. These elements are typically used in the same places a paragraph may be used, for example, inside a section after the section title."
Now, in JATS, the element
<p>
can contain some of such block-like elements, e.g.<code>
:Because
<code>
is supposed to sit at paragraph-level, the above should look like three independent paragraphs separated by a whitespace, although in JATS it is all just one paragraph.Pandoc seems to have taken this into account, with the
parseMixed
function of the JATS reader, where every element is parsed either as a block, or as an inline, as appropriate:https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L202-L211
According tho the above, when a block-like element is found, the reader parses it as a block, creating a separate paragraph for it. Otherwise, the reader parses the element as an inline.
The problem The output of the
isBlockElement
function determines if the element gets to be parsed as a block or as an inline (see L203 in code extract above). However, theisBlockElement
function actually only returns TRUE if the input element is a<p>
(See discussion here). SinceisBlockElement
is only ever called over children of<p>
*,isBlockElement
is never TRUE**, and as a result, all children of<p>
are always and systematically parsed as inlines (L208-L2011 above are never reached).So the whole purpose of the
parsedMixed
function is defeated. This is causing issue https://github.com/jgm/pandoc/issues/8804, for instance.*This is because
isBlockElement
is only ever called fromparseMixed
, andparseMixed
is only ever called from the case of parsing<p>
. ** This is because in the JATS specification<p>
cannot contain a<p>
, in other words, no input element ofisBlockElement
is ever a<p>
that could yield a value TRUE.The root cause It all boils down to a likely mixup when the
isBlockElement
function was originally adapted from the DockBook reader (do scroll down to see it all):https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L106-L132
The function lists the elements that should be considered as paragraph-level elements, then filters out any inline element. The problem is, it defines inline elements as simply any elements contained inside a paragraph (the list called
inlinetags
is an exact copy of all elements that can be contained inside the JATS 1.1 spec of<p>
). The problem of this approach to define inline elements is that it does not acknowledge body blocks and other paragraph-level elements that can exist inside<p>
elements, as defined at the beginning of this post.The only survivor of this filter is the element
<p>
itself (which is useless in this particular context).The solution A trivial solution would be to not filter out the inline elements from the allowed block elements in
isBlockElement
(This is achieved by removing L117-L131, and\\ S.fromList inlinetags
from L108).But this might not be a trivial problem, and I might as well be missing something from the history of the
isBlockElement
function.Thoughts, anyone?