Closed coryschires closed 1 year ago
My approach would be to add a statement
case to the parseBlock
function in file src/Text/Pandoc/Readers/JATS.hs
at line 164 ff. The local parseFigure
and similar functions look like they could serve as a starting point.
The way the parseBlock function is written in the JATS reader, it would seem to me that this will be the case for all elements containing <label>
and <title>
, not only <statement>
. Lines 185-186 systematically filters them out:
"title" -> return mempty -- processed by header
"label" -> return mempty -- processed by header
This will be for all elements, except for the <sec>
element, which deals with <label>
and <title>
specifically, in lines 317-334:
sect n = do isbook <- gets jatsBook
let n' = if isbook || n == 0 then n + 1 else n
labelText <- case filterChild (named "label") e of
Just t -> (<> ("." <> space)) <$> getInlines t
Nothing -> return mempty
headerText <- case filterChild (named "title") e
mplus`
(filterChild (named "info") e >>= filterChild (named "title")) of
Just t -> (labelText <>) <$> getInlines t
Nothing -> return mempty
oldN <- gets jatsSectionLevel
modify $ \st -> st{ jatsSectionLevel = n }
b <- getBlocks e
let ident = attrValue "id" e
modify $ \st -> st{ jatsSectionLevel = oldN }
return $ headerWith (ident,[],[]) n' headerText <> b`
(invoked from line 180):
"sec" -> gets jatsSectionLevel >>= sect . (+1)
So before defining an approach, I would like to ask, why was there an assumption that <title>
and <label>
were always section headings?
A solution would be to somehow expand on @tarleb suggestion, but for all elements affected, e.g. <bio>
, <notes>
, <glossary>
, <kwd-group>
, <app>
, <app-group>
, <back>
, <abstract>
, <ack>
, which all contain <title>
and <label>
elements which get filtered out if not inside <sec>
s, but maybe adding a case for each one in parseBlock
is not the most efficient course of action? Is there a way to address this from the root? (lines 185-186)
@hamishmack I know that this is a long shot for code that was written 6 years ago, but do you happen to remember the reasons for why things mentioned above are the way they are?
The code for JATS reader was based on the the DocBook one. Perhaps this is just something we should have changed for JATS, but did not.
Here is the code from the DocBook.hs file at the time: https://github.com/jgm/pandoc/blob/5d3c9e56460165be452b672f12fc476e7a5ed3a9/src/Text/Pandoc/Readers/DocBook.hs#L893-L904
That code has changed only slightly and now look like this: https://github.com/jgm/pandoc/blob/509cb9b8feae6798cb77bc35637297e9301d682e/src/Text/Pandoc/Readers/DocBook.hs#L1081-L1093
The two changes were https://github.com/jgm/pandoc/commit/12a35dd0d0f7363ad5b85ab859925113c65aa61f and https://github.com/jgm/pandoc/commit/40aa74badc2686b8b9a4ae7f836529cec2f4779b.
+1 to https://github.com/jgm/pandoc/issues/8718#issuecomment-1489322480
I agree this is a more general problem than I initially understood. I encountered the same problem when working with JATS's <ack>
tag. Like <statement>
, <ack>
can include nested <label>
and / or <title>
tags. These tags are similarly dropped when converting from JATS to MD.
Also, agreeing with https://github.com/jgm/pandoc/issues/8718#issuecomment-1489322480, I was able to work around this problem by converting the <ack>
to <sec>
. After making that change, the <label>
/ <title>
are retained.
@hamishmack I think the problem is not in the sect
function but in parseBlock
.
If the JATs reader was based on the DocBook one, I can see how it could be assumed that section headers would handle all possible occurrences of <title>
and <label>
. But truth is, they don't. In the JATS reader, the two below lines completely write off the content of <title>
and <label>
elements outside of <sec>
elements (which handle headers).
I can see that if we just remove these two lines, then we will duplicate <title>
and <label>
content in <sec>
elements. Therefore, one improvement that is in order is to consider the specifics of the JATS specs that differ from the DocBook model, in particular, that there are 30+ different JATS elements different than <sec>
that contain <title>
and <label>
elements, and therefore a big number of potential missing cases in the parseBlock
function, which could all be modeled after the cases of either <sec>
or <caption>
.
I am not sure if there is another more compact solution.
The question is how the <title>
should be represented in these other elements.
In section, it becomes the contents of a Header element.
:spiral_notepad: Based on the JATS Publishing Spec which differs slightly from the JATS Archiving Spec :spiral_notepad:
<title>
be represented in these other elements?TLDR: The JATS <title>
tag should translate to a Header
in all cases.
I've laid out all possible use cases in exhaustive detail. That said, I'll admit it's a little hard to give definitive answers in all cases because (a) the JATS spec is unfortunately loose and (b) many of these elements are not commonly seen in the wild (e.g. <question-preamble>
), making it difficult to know how they are most commonly used.
Where possible, I based my answers on examples from the spec or cases I found in the wild (e.g. on PLOS ONE). When I couldn't find examples, I simply used my best judgement. (I'm certainly not the world's premiere JATS expert but, fwiw, I am a member of the JATS Standing Committee so I'm not clueless).
The <title>
tag may be contained in:
<abstract>
Header
element<ack>
Header
element<answer>
Header
element<answer-set>
Header
element<app>
Header
element<app-group>
Header
element<author-comment>
Header
element<author-notes>
Header
element<back>
Header
element<bio>
Header
element<caption>
Header
element<def-list>
Header
element<disp-quote>
Header
element<explanation>
Header
element<fn-group>
Header
element<glossary>
Header
element<kwd-group>
Header
element<list>
Header
element<list-item>
Header
element<notes>
Header
element<option>
Header
element<question>
Header
element<question-preamble>
Header
element<question-wrap-group>
Header
element<ref-list>
Header
element<sec>
Header
element<statement>
Header
element<supplement>
Header
element<table-wrap-foot>
Header
element<trans-abstract>
Header
element<verse-group>
Header
element<label>
and <title>
?TLDR: Both <label>
and <title>
should be converted into Header
.
Same disclaimer as above: I'm doing my best to grapple with inherent (and unfortunate) ambiguity.
According to the JATS spec:
<label>
– Number and/or prefix word placed at the beginning of display elements (for example, equation, statement, figure).<title>
– Heading or title for a structural element (for example, <sec>
, <app>
, <boxed-text>
).Furthermore, in most (perhaps all?) cases, an element (e.g. <sec>
) can contain both a <label>
and <title>
– and they must appear in that order:
<sec>
<label>3.</label>
<title>Conclusions</title>
</sec>
However, it's also common to only have one of either <label>
or <title>
:
<sec>
<title>Conclusions</title>
</sec>
<statement>
<label>Hypothesis 1</label>
<p>Buyer preferences for companies are influenced...</p>
</statement>
<label>
/ <title>
combinations?Given a <sec>
includes both a <label>
and <title>
<sec>
<label>3.</label>
<title>Conclusions</title>
</sec>
Then collapse them into a single Header
# 3. Conclusions
Given a <statement>
includes only a <label>
<statement>
<label>Hypothesis 1</label>
<p>Buyer preferences for companies are influenced...</p>
</statement>
Then convert the <label>
into a Header
# Hypothesis 1
Buyer preferences for companies are influenced...
Given a <sec>
includes only a <title>
<sec>
<title>Conclusions</title>
</sec>
Then convert the <title>
into a Header
# Conclusions
Hope this helps clarify possible next steps!
As far as I understand the JATS reader has been written to comply with JATS Archiving and Interchange (the element <x>
, present only in Archiving, is acknowledged in the isBlock
function of the JATS reader), I suppose, on purpose, since it is the most complete of the three (it has the most elements, and the more options). Making Pandoc JATS Archiving compliant makes it compliant with all three(?). But anyone please correct this assumption if wrong.
The JATS reader currently allows two elements to display <title>
and <label>
children. These are <sec>
and <caption>
. This is implemented with two cases inside the parseBlock
function. The case for "sec" creates a Header one level higher than the current level:
whilst the case for "caption" creates a Header of level 6, whatever the current level in the document is:
I believe this just means, if subsection, create a Header one section higher; if caption, just create a small caption that is not too overwhelming for the context of the document(?).
So the real question is which one of the two strategies, the +1 level; or the fixed level 6 is appropriate (or if any other level is appropriate) for each of the 31 elements that contain <title>
.
My suggestion is any element that contains a <sec>
, or is recursive (contains itself, or an element that contains it), should create a +1 level Header (and I think this is the case of most elements); and any element that contains more immediately contained content can get away with a level 6 strategy. But might not be as trivial.
Then, not all elements that contain <label>
contain <title>
. The approach for the label-only elements might be different.
Last but not least, I think this has clearly become an epic concerning <title>
and <label>
across all elements, not only a specific bug concerning the <statement>
element. For instance, addressing this as an epic would absorb issues like https://github.com/jgm/pandoc/issues/7168, https://github.com/jgm/pandoc/issues/8364, and https://github.com/jgm/pandoc/issues/8365.
As far as I understand the JATS reader has been written to comply with JATS Archiving and Interchange (the element
, present only in Archiving, is acknowledged in the isBlock function of the JATS reader), I suppose, on purpose, since it is the most complete of the three (it has the most elements, and the more options). Making Pandoc JATS Archiving compliant makes it compliant with all three(?).
There are 3 JATS tag sets outline here: https://en.wikipedia.org/wiki/Journal_Article_Tag_Suite#Tag_sets.
Archiving and Interchange is the largest and most permissive. Publishing is a subset of Archiving and Authoring is a subset of Publishing. So you can think of it like: Archiving > Publishing > Authoring.
So, which version should Pandoc target or prefer?
Unfortunately, I don't think there's an obviously "right" answer to this question. As you point out, Archiving is the largest. But, fwiw, Publishing is the most commonly used (and I suspect by a wide margin).
Here's some additional context based on real-world use cases I have observed.
My opinion: If Pandoc only wants to target a single version of JATS, I would vote for Publishing. It's the most widely used and thus presumably the most useful. I suspect this is because creating JATS (especially full-text) is often very expensive, so publishers would never do this work for an article unless it were destined for publication (i.e. no one is making full-text JATS for a desk-rejected manuscript).
So the real question is which one of the two strategies, the +1 level; or the fixed level 6 is appropriate (or if any other level is appropriate) for each of the 31 elements that contain
<title>
.
I think "the +1 level" is basically correct in essentially all cases.
IMO, <caption>
, the counter example you provided, is an odd case because it's not really a header at all. It's a figure / table title. While it's certainly header-like, I'd argue – a bit pedantically, perhaps – that it's semantically different.
But, of course, Markdown doesn't have any notion of figure / table title, so we gotta do our best. For this reason, any solution / header-level is likely to feel a bit odd. H6 makes sense under the circumstances, and I don't have a better solution.
FWIW, in our tooling, we handle figure / title parts differently than the rest of the document. We do so because, afaik, Pandoc's AST doesn't leave enough space to support our use cases (which I think is a totally defensible position given the huge variety of use cases y'all support).
I think this has clearly become an epic concerning
and
Agreed. We're happy to help with code changes. But at this point, I am unfortunately not sure about the next steps.
So, which version should Pandoc target or prefer? Unfortunately, I don't think there's an obviously "right" answer to this question.
Why not? Pandoc is not a validation tool. If it supports more elements than it "should" for any given document, it is not going to produce an error. It will parse the document correctly anyway.
As far as I can gather from the JATS reader source code, all Pandoc does is consider the cases of selected elements, builds Pandoc structures in the AST document for those elements; then continues parsing that elements's children recursively. If an element does not have a case defined, then its children are parsed, and execution continues down the line, until all is left is plain text. So unless you specifically wipe an element off (as we are currently doing with <title>
and <label>
), no content will get ignored (It might not be pretty formatted, but it will be there in the final AST). That's it.
Therefore, supporting a specific JATS suit only means that enough significant elements proper to that suite have been included in the parsing cases, or in the helper functions. Now, as of the latest released version of Pandoc, the element <X>
, which is only present in JATS Archival and Interchange, and not in either of the two other suites, has indeed a case defined in the isBlock
function. That is why I gather that, currently, Pandoc supports, or at least tries to support, JATS Archiving and Interchange.
This decision makes sense to me because, following the logic of case coverage, if it supports the most complete one, it supports all three of the JATS suites.
As you point out, Archiving is the largest. But, fwiw, Publishing is the most commonly used (and I suspect by a wide margin).
I am working towards expanding the JATS reader to include support for BITS, which is an extension of JATS Archival and Interchange. So even if the JATS reader did not support that now, when the BITS reader is ready, it will. And this does not mean it will not support the other two suites.
I think this has clearly become an epic concerning
and across all elements, not only a specific bug concerning the element. Agreed. We're happy to help with code changes. But at this point, I am unfortunately not sure about the next steps.
Same, I'm not sure if we can just propose changes, or at what point any given proposal gets the go ahead.
If you're both agreed on what should change, maybe you could give an executive summary here? I'm not a user of JATS myself, so I'm a bit lost in the details!
I believe the bottomline is, we need to write cases in the parseBlock
function of the JATS reader, for the following elements:
<abstract>
<ack>
<answer>
<answer-set>
<app>
<app-group>
<author-comment>
<author-notes>
<back>
<bio>
<def-list>
<disp-quote>
<explanation>
<fn-group>
<glossary>
<kwd-group>
<list>
<list-item>
<notes>
<option>
<question>
<question-preamble>
<question-wrap-group>
<ref-list>
<sec>
<statement>
<supplement>
<table-wrap-foot>
<trans-abstract>
<verse-group>
A reasonable solution would be if we just treated all those elements the same way we treat <sec>
, that is, create a subsection with a header that is one level higher than the current header:
Because this solution addresses the issue for all elements that contain a <title>
, we are not only addressing the original issue reported here, but also https://github.com/jgm/pandoc/issues/8364, and https://github.com/jgm/pandoc/issues/8365.
Do you agree, @coryschires?
@kamoe Well put. I agree with your summary, and think we can / should move forward with those changes.
A few additional notes:
Thanks for your helpful insights! I'll talk to my team and see if we'd be able to contribute to this change.
Thanks. Currently we have
"title" -> return mempty -- processed by header
Replacing this with something that parses the inline contents of the title element, emits a header at currentlevel + 1, and (unlike sect) does not modify currentlevel, would probably get you most of the way there. The only issue is that we'd have to continue processing "title" specially when it occurs in the initial info element.
<info>
element in JATS. I suppose L330-332: https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L330-L332 are legacy from the DocBook reader. Could be cleaned up as:headerText <- case filterChild (named "title") e of
sec
, dealing with children inside the parent case is unavoidable, as we want to create a single Header with both label and text together, i.e. : then
Which means, the label
text has been prepended to the headerText
before the building of the Header, which cannot be achieved if label
and text
are dealt with as independent cases. So we might need to leave the case for sec
as it is.
by anything that produces actual content, we will get a duplicate title when it comes to parse the internal contents ofsec
, in L338?:
https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L338
Since getBlocks
calls parseBlock
, which would build a title.
For me, the ideal solution would be to take the processing of label
and title
, i.e. the below lines:
https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L326-L335
out of the sect
function and into the respective cases of title
and label
, replacing the mempty
lines, and proceed as you suggest. But then we will have to give up on having Sections with one single Header with title and label on it.
I just reviewed the mechanics of label
and title
in JATS when they co-occur, and it really seems like having them both on the same heading is imperative (e.g. "Section 8 Technical background", where "Section 8" is the label
and "Technical background" is the title
).
As such, I don't think the ideal solution of dealing with title
and label
individually outside of sect
is actually a good one. Hence why I suggested to simply write the cases for the missing elements and leaving sect
as it is. It's not nice or compact, but it's less problematic, and it's guaranteed to fix the issue.
But let me know I am missing something important here that could make the ideal alternative work.
@kamoe @jgm Just a quick FYI. On the JATS Standing Committee call today, I was able to ask the group:
If Pandoc only targets a single version of JATS, which version should it be?
They unanimously agreed that Pandoc should target JATS Archiving and Interchange. This is what @kamoe suggested above, so really just a +1 from the experts that our current plan is a good one.
Thanks for all of the direction on this @kamoe ! I'm going to take a crack at it
Got a fix up here: #8840 if anyone wants to take a look!
@noahmalmed Great work, thank you! I have finished adding all my comments, and have also asked a question on the PR wrt what the best way to proceed is to have this merged. Happy to help in any way I can.
@noahmalmed I'm afraid there is a problem with the current solution tough. If you notice, it only solves the problem for <title>
, but not for <label>
. We are completely removing the <label>
even if no <title>
is present. I thought the rationale was to remove <label>
if and only if also a <title>
was present...
@coryschires What do you think?
@kamoe Oh shoot, I guess I misunderstood. @jgm , was the idea to only suppress label when title was present or did we want to just fully suppress it?
@noahmalmed Just to clarify, I think it might or might not make sense to leave as is, I am just pointing out that we might need to retain a lone <label>
in certain circumstances.
I'm thinking more in terms of current JATS practice, like the examples given above by @coryschires originally when they opened the issue:
<statement>
<label>Hypothesis 1</label>
<p>We hypothesize the following</p>
</statement>
Would "Hypotesis 1" be something that we might absolutely need to keep as label? (knowing there is no title). Or is this something that will eventually be duplicated?
Problem
When converting from JATS to Markdown (or HTML and probably other outputs), both
<label>
and<title>
tags are ignored when nested within<statement>
.Given the following (abridged) JATS XML:
The following command:
Will produce to following Markdown:
Solution / Expected Behavior
Instead, I would expect the label (or title) to be retained:
Steps to recreate
Pandoc Version
Sample JATS XML files
statement-with-label.xml
statement-with-title.xml
Commands to reproduce
Thanks for the help! If the fix is easy, we might be able to contribute a PR. Just point us in the right direction.