Output formats fails to retain JATS `<title>` and `<label>` when nested inside `<statement>`

coryschires commented 1 year ago

Problem

When converting from JATS to Markdown (or HTML and probably other outputs), both <label> and <title> tags are ignored when nested within <statement>.

Given the following (abridged) JATS XML:

<statement>
  <label>Hypothesis 1</label>
  <p>We hypothesize the following</p>
</statement>

The following command:

pandoc statement-with-label.xml -f jats -t markdown -o statement-with-label.md

Will produce to following Markdown:

#

We hypothesize the following

Solution / Expected Behavior

Instead, I would expect the label (or title) to be retained:

# Hypothesis 1

We hypothesize the following

Steps to recreate

Pandoc Version

pandoc 3.1
Features: +server +lua
Scripting engine: Lua 5.4

Sample JATS XML files

statement-with-label.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublishing3.dtd">
<article article-type="research-article" dtd-version="3.0" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
  <body>
    <sec>
      <statement>
        <label>Hypothesis 1</label>
        <p>We hypothesize the following</p>
      </statement>
    </sec>
  </body>
</article>

statement-with-title.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublishing3.dtd">
<article article-type="research-article" dtd-version="3.0" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
  <body>
    <sec>
      <statement>
        <title>Hypothesis 1</title>
        <p>We hypothesize the following</p>
      </statement>
    </sec>
  </body>
</article>

Commands to reproduce

# Markdown as output
pandoc statement-with-label.xml -f jats -t markdown -o statement-with-label.md
pandoc statement-with-title.xml -f jats -t markdown -o statement-with-title.md

# HTML as output
pandoc statement-with-label.xml -f jats -t html -o statement-with-label.html
pandoc statement-with-title.xml -f jats -t html -o statement-with-title.html

Thanks for the help! If the fix is easy, we might be able to contribute a PR. Just point us in the right direction.

tarleb commented 1 year ago

My approach would be to add a statement case to the parseBlock function in file src/Text/Pandoc/Readers/JATS.hs at line 164 ff. The local parseFigure and similar functions look like they could serve as a starting point.

kamoe commented 1 year ago

The way the parseBlock function is written in the JATS reader, it would seem to me that this will be the case for all elements containing <label> and <title>, not only <statement>. Lines 185-186 systematically filters them out:

"title" -> return mempty -- processed by header "label" -> return mempty -- processed by header

This will be for all elements, except for the <sec> element, which deals with <label> and <title> specifically, in lines 317-334:

sect n = do isbook <- gets jatsBook let n' = if isbook || n == 0 then n + 1 else n labelText <- case filterChild (named "label") e of Just t -> (<> ("." <> space)) <$> getInlines t Nothing -> return mempty headerText <- case filterChild (named "title") emplus` (filterChild (named "info") e >>= filterChild (named "title")) of Just t -> (labelText <>) <$> getInlines t Nothing -> return mempty oldN <- gets jatsSectionLevel modify $ \st -> st{ jatsSectionLevel = n } b <- getBlocks e let ident = attrValue "id" e modify $ \st -> st{ jatsSectionLevel = oldN } return $ headerWith (ident,[],[]) n' headerText <> b`

(invoked from line 180): "sec" -> gets jatsSectionLevel >>= sect . (+1)

So before defining an approach, I would like to ask, why was there an assumption that <title> and <label> were always section headings?

A solution would be to somehow expand on @tarleb suggestion, but for all elements affected, e.g. <bio>, <notes>, <glossary>, <kwd-group>, <app>, <app-group>, <back>, <abstract>, <ack>, which all contain <title> and <label> elements which get filtered out if not inside <sec>s, but maybe adding a case for each one in parseBlock is not the most efficient course of action? Is there a way to address this from the root? (lines 185-186)

tarleb commented 1 year ago

@hamishmack I know that this is a long shot for code that was written 6 years ago, but do you happen to remember the reasons for why things mentioned above are the way they are?

hamishmack commented 1 year ago

The code for JATS reader was based on the the DocBook one. Perhaps this is just something we should have changed for JATS, but did not.

Here is the code from the DocBook.hs file at the time: https://github.com/jgm/pandoc/blob/5d3c9e56460165be452b672f12fc476e7a5ed3a9/src/Text/Pandoc/Readers/DocBook.hs#L893-L904

That code has changed only slightly and now look like this: https://github.com/jgm/pandoc/blob/509cb9b8feae6798cb77bc35637297e9301d682e/src/Text/Pandoc/Readers/DocBook.hs#L1081-L1093

The two changes were https://github.com/jgm/pandoc/commit/12a35dd0d0f7363ad5b85ab859925113c65aa61f and https://github.com/jgm/pandoc/commit/40aa74badc2686b8b9a4ae7f836529cec2f4779b.

coryschires commented 1 year ago

+1 to https://github.com/jgm/pandoc/issues/8718#issuecomment-1489322480

I agree this is a more general problem than I initially understood. I encountered the same problem when working with JATS's <ack> tag. Like <statement>, <ack> can include nested <label> and / or <title> tags. These tags are similarly dropped when converting from JATS to MD.

Also, agreeing with https://github.com/jgm/pandoc/issues/8718#issuecomment-1489322480, I was able to work around this problem by converting the <ack> to <sec>. After making that change, the <label> / <title> are retained.

kamoe commented 1 year ago

@hamishmack I think the problem is not in the sect function but in parseBlock.

If the JATs reader was based on the DocBook one, I can see how it could be assumed that section headers would handle all possible occurrences of <title> and <label>. But truth is, they don't. In the JATS reader, the two below lines completely write off the content of <title> and <label> elements outside of <sec> elements (which handle headers).

https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L185-L186

I can see that if we just remove these two lines, then we will duplicate <title> and <label> content in <sec> elements. Therefore, one improvement that is in order is to consider the specifics of the JATS specs that differ from the DocBook model, in particular, that there are 30+ different JATS elements different than <sec> that contain <title> and <label> elements, and therefore a big number of potential missing cases in the parseBlock function, which could all be modeled after the cases of either <sec> or <caption>.

I am not sure if there is another more compact solution.

jgm commented 1 year ago

The question is how the <title> should be represented in these other elements. In section, it becomes the contents of a Header element.

coryschires commented 1 year ago

:spiral_notepad: Based on the JATS Publishing Spec which differs slightly from the JATS Archiving Spec :spiral_notepad:

How should `<title>` be represented in these other elements?

TLDR: The JATS <title> tag should translate to a Header in all cases.

I've laid out all possible use cases in exhaustive detail. That said, I'll admit it's a little hard to give definitive answers in all cases because (a) the JATS spec is unfortunately loose and (b) many of these elements are not commonly seen in the wild (e.g. <question-preamble>), making it difficult to know how they are most commonly used.

Where possible, I based my answers on examples from the spec or cases I found in the wild (e.g. on PLOS ONE). When I couldn't find examples, I simply used my best judgement. (I'm certainly not the world's premiere JATS expert but, fwiw, I am a member of the JATS Standing Committee so I'm not clueless).

The <title> tag may be contained in:

<abstract>
- Should render as Header element
<ack>
- Should render as Header element
<answer>
- Should render as Header element
<answer-set>
- Should render as Header element
<app>
- Should render as Header element
<app-group>
- Should render as Header element
<author-comment>
- Should render as Header element
<author-notes>
- Should render as Header element
<back>
- Should render as Header element
<bio>
- Should render as Header element
<caption>
- Should render as Header element
<def-list>
- Should render as Header element
<disp-quote>
- Should render as Header element
<explanation>
- Should render as Header element
<fn-group>
- Should render as Header element
<glossary>
- Should render as Header element
<kwd-group>
- Should render as Header element
<list>
- Should render as Header element
<list-item>
- Should render as Header element
<notes>
- Should render as Header element
<option>
- Should render as Header element
<question>
- Should render as Header element
<question-preamble>
- Should render as Header element
<question-wrap-group>
- Should render as Header element
<ref-list>
- Should render as Header element
<sec>
- Should render as Header element
<statement>
- Should render as Header element
<supplement>
- Should render as Header element
<table-wrap-foot>
- Should render as Header element
<trans-abstract>
- Should render as Header element
<verse-group>
- Should render as Header element

What's the difference between `<label>` and `<title>`?

TLDR: Both <label> and <title> should be converted into Header.

Same disclaimer as above: I'm doing my best to grapple with inherent (and unfortunate) ambiguity.

According to the JATS spec:

<label> – Number and/or prefix word placed at the beginning of display elements (for example, equation, statement, figure).
<title> – Heading or title for a structural element (for example, <sec>, <app>, <boxed-text>).

Furthermore, in most (perhaps all?) cases, an element (e.g. <sec>) can contain both a <label> and <title> – and they must appear in that order:

<sec>
  <label>3.</label>
  <title>Conclusions</title>
</sec>

However, it's also common to only have one of either <label> or <title>:

<sec>
  <title>Conclusions</title>
</sec>

<statement>
  <label>Hypothesis 1</label>
  <p>Buyer preferences for companies are influenced...</p>
</statement>

So how to handle these `<label>` / `<title>` combinations?

Given a <sec> includes both a <label> and <title>

<sec>
  <label>3.</label>
  <title>Conclusions</title>
</sec>

Then collapse them into a single Header

# 3. Conclusions

Given a <statement> includes only a <label>

<statement>
  <label>Hypothesis 1</label>
  <p>Buyer preferences for companies are influenced...</p>
</statement>

Then convert the <label> into a Header

# Hypothesis 1

Buyer preferences for companies are influenced...

Given a <sec> includes only a <title>

<sec>
  <title>Conclusions</title>
</sec>

Then convert the <title> into a Header

# Conclusions

Hope this helps clarify possible next steps!

kamoe commented 1 year ago

As far as I understand the JATS reader has been written to comply with JATS Archiving and Interchange (the element <x>, present only in Archiving, is acknowledged in the isBlock function of the JATS reader), I suppose, on purpose, since it is the most complete of the three (it has the most elements, and the more options). Making Pandoc JATS Archiving compliant makes it compliant with all three(?). But anyone please correct this assumption if wrong.

The JATS reader currently allows two elements to display <title> and <label> children. These are <sec> and <caption>. This is implemented with two cases inside the parseBlock function. The case for "sec" creates a Header one level higher than the current level:

https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L180

whilst the case for "caption" creates a Header of level 6, whatever the current level in the document is:

https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L193-L197

I believe this just means, if subsection, create a Header one section higher; if caption, just create a small caption that is not too overwhelming for the context of the document(?).

So the real question is which one of the two strategies, the +1 level; or the fixed level 6 is appropriate (or if any other level is appropriate) for each of the 31 elements that contain <title>.

My suggestion is any element that contains a <sec>, or is recursive (contains itself, or an element that contains it), should create a +1 level Header (and I think this is the case of most elements); and any element that contains more immediately contained content can get away with a level 6 strategy. But might not be as trivial.

Then, not all elements that contain <label> contain <title>. The approach for the label-only elements might be different.

Last but not least, I think this has clearly become an epic concerning <title> and <label> across all elements, not only a specific bug concerning the <statement> element. For instance, addressing this as an epic would absorb issues like https://github.com/jgm/pandoc/issues/7168, https://github.com/jgm/pandoc/issues/8364, and https://github.com/jgm/pandoc/issues/8365.

coryschires commented 1 year ago

As far as I understand the JATS reader has been written to comply with JATS Archiving and Interchange (the element , present only in Archiving, is acknowledged in the isBlock function of the JATS reader), I suppose, on purpose, since it is the most complete of the three (it has the most elements, and the more options). Making Pandoc JATS Archiving compliant makes it compliant with all three(?).

There are 3 JATS tag sets outline here: https://en.wikipedia.org/wiki/Journal_Article_Tag_Suite#Tag_sets.

Journal Archiving and Interchange (Green)
Journal Publishing (Blue)
Article Authoring (Orange)

Archiving and Interchange is the largest and most permissive. Publishing is a subset of Archiving and Authoring is a subset of Publishing. So you can think of it like: Archiving > Publishing > Authoring.

So, which version should Pandoc target or prefer?

Unfortunately, I don't think there's an obviously "right" answer to this question. As you point out, Archiving is the largest. But, fwiw, Publishing is the most commonly used (and I suspect by a wide margin).

Here's some additional context based on real-world use cases I have observed.

Journal Archiving and Interchange (Green)
- Often used to capture and share manuscripts between platforms. To be clear, a manuscript is a non-published article which could be headed for publication or, perhaps, will never be published (because, for example, it was rejected).
Journal Publishing (Blue)
- Used by publishers for lots of stuff.
- Vendors (e.g. SilverChair) often accept JATS as a input format which they then convert to other formats (e.g. HTML) using internal tooling. Similarly PMC accepts Publishing JATS (tho they may also accept Archiving, I'm not sure).
- Archiving services (e.g. Portico, CLOCKSS) accept Publishing JATS – though I am pretty sure they would also accept Archiving.
- This JATS is often published / made available alongside other formats (e.g. both PLOS ONE and Scholastica allow JATS download for published articles).
Article Authoring (Orange)
- I've never seen this used in the wild. That said, I don't work on authoring tools, so this could very well be due to the limitations of my perspective / work.

My opinion: If Pandoc only wants to target a single version of JATS, I would vote for Publishing. It's the most widely used and thus presumably the most useful. I suspect this is because creating JATS (especially full-text) is often very expensive, so publishers would never do this work for an article unless it were destined for publication (i.e. no one is making full-text JATS for a desk-rejected manuscript).

So the real question is which one of the two strategies, the +1 level; or the fixed level 6 is appropriate (or if any other level is appropriate) for each of the 31 elements that contain <title>.

I think "the +1 level" is basically correct in essentially all cases.

IMO, <caption>, the counter example you provided, is an odd case because it's not really a header at all. It's a figure / table title. While it's certainly header-like, I'd argue – a bit pedantically, perhaps – that it's semantically different.

But, of course, Markdown doesn't have any notion of figure / table title, so we gotta do our best. For this reason, any solution / header-level is likely to feel a bit odd. H6 makes sense under the circumstances, and I don't have a better solution.

FWIW, in our tooling, we handle figure / title parts differently than the rest of the document. We do so because, afaik, Pandoc's AST doesn't leave enough space to support our use cases (which I think is a totally defensible position given the huge variety of use cases y'all support).

I think this has clearly become an epic concerning and <label> across all elements, not only a specific bug concerning the <statement> element. </p> </blockquote> <p>Agreed. We're happy to help with code changes. But at this point, I am unfortunately not sure about the next steps. </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kamoe"><img src="https://avatars.githubusercontent.com/u/8106866?v=4" />kamoe</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <blockquote> <p>So, which version should Pandoc target or prefer? Unfortunately, I don't think there's an obviously "right" answer to this question. </p> </blockquote> <p>Why not? Pandoc is not a validation tool. If it supports more elements than it "should" for any given document, it is not going to produce an error. It will parse the document correctly anyway.</p> <p>As far as I can gather from the JATS reader source code, all Pandoc does is consider the cases of selected elements, builds Pandoc structures in the AST document for those elements; then continues parsing that elements's children recursively. If an element does not have a case defined, then its children are parsed, and execution continues down the line, until all is left is plain text. So unless you specifically wipe an element off (as we are currently doing with <code><title></code> and <code><label></code>), no content will get ignored (It might not be pretty formatted, but it will be there in the final AST). That's it.</p> <p>Therefore, supporting a specific JATS suit only means that enough significant elements proper to that suite have been included in the parsing cases, or in the helper functions. Now, as of the latest released version of Pandoc, the element <code><X></code>, which is only present in JATS Archival and Interchange, and not in either of the two other suites, has indeed a case defined in the <code>isBlock</code> function. That is why I gather that, currently, Pandoc supports, or at least tries to support, JATS Archiving and Interchange. </p> <p>This decision makes sense to me because, following the logic of case coverage, if it supports the most complete one, it supports all three of the JATS suites.</p> <blockquote> <p>As you point out, Archiving is the largest. But, fwiw, Publishing is the most commonly used (and I suspect by a wide margin).</p> <p>I am working towards <a href="https://github.com/jgm/pandoc/issues/7740">expanding the JATS reader to include support for BITS</a>, which is an extension of JATS Archival and Interchange. So even if the JATS reader did not support that now, when the BITS reader is ready, it will. And this does not mean it will not support the other two suites.</p> <blockquote> <p>I think this has clearly become an epic concerning <title> and across all elements, not only a specific bug concerning the element.</p> </blockquote> <p>Agreed. We're happy to help with code changes. But at this point, I am unfortunately not sure about the next steps.</p> </blockquote> <p>Same, I'm not sure if we can just propose changes, or at what point any given proposal gets the go ahead.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jgm"><img src="https://avatars.githubusercontent.com/u/3044?v=4" />jgm</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>If you're both agreed on what should change, maybe you could give an executive summary here? I'm not a user of JATS myself, so I'm a bit lost in the details!</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kamoe"><img src="https://avatars.githubusercontent.com/u/8106866?v=4" />kamoe</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>I believe the bottomline is, we need to write cases in the <code>parseBlock</code> function of the JATS reader, for the following elements:</p> <p><code><abstract></code> <code><ack></code> <code><answer></code> <code><answer-set></code> <code><app></code> <code><app-group></code> <code><author-comment></code> <code><author-notes></code> <code><back></code> <code><bio></code> <code><def-list></code> <code><disp-quote></code> <code><explanation></code> <code><fn-group></code> <code><glossary></code> <code><kwd-group></code> <code><list></code> <code><list-item></code> <code><notes></code> <code><option></code> <code><question></code> <code><question-preamble></code> <code><question-wrap-group></code> <code><ref-list></code> <code><sec></code> <code><statement></code> <code><supplement></code> <code><table-wrap-foot></code> <code><trans-abstract></code> <code><verse-group></code></p> <p>A reasonable solution would be if we just treated all those elements the same way we treat <code><sec></code>, that is, create a subsection with a header that is one level higher than the current header:</p> <p><a href="https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L180">https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L180</a></p> <p>Because this solution addresses the issue for all elements that contain a <code><title></code>, we are not only addressing the original issue reported here, but also <a href="https://github.com/jgm/pandoc/issues/8364">https://github.com/jgm/pandoc/issues/8364</a>, and <a href="https://github.com/jgm/pandoc/issues/8365">https://github.com/jgm/pandoc/issues/8365</a>.</p> <p>Do you agree, @coryschires?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/coryschires"><img src="https://avatars.githubusercontent.com/u/104563?v=4" />coryschires</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>@kamoe Well put. I agree with your summary, and think we can / should move forward with those changes.</p> <p>A few additional notes:</p> <ul> <li>Regarding the goal of targeting <em>JATS Archival and Interchange</em>... No objections from me. While I target Publishing, I am yet to encounter an issue / conflict between Archival and Publishing. So, overall, I'd say it's not a problem until it's a problem (which may be never).</li> <li>If we encounter cases where "the +1 level" solution is inappropriate, we can carve out a condition at that time. I am at least confident that "the +1 level" is correct for the majority of cases.</li> </ul> <p>Thanks for your helpful insights! I'll talk to my team and see if we'd be able to contribute to this change.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jgm"><img src="https://avatars.githubusercontent.com/u/3044?v=4" />jgm</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>Thanks. Currently we have</p> <pre><code class="language-hs"> "title" -> return mempty -- processed by header</code></pre> <p>Replacing this with something that parses the inline contents of the title element, emits a header at currentlevel + 1, and (unlike sect) does not modify currentlevel, would probably get you most of the way there. The only issue is that we'd have to continue processing "title" specially when it occurs in the initial info element.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kamoe"><img src="https://avatars.githubusercontent.com/u/8106866?v=4" />kamoe</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <ol> <li>I was going to point this out earlier but forgot: There is no <code><info></code> element in JATS. I suppose L330-332: <a href="https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L330-L332">https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L330-L332</a> are legacy from the DocBook reader. Could be cleaned up as:</li> </ol> <p><code>headerText <- case filterChild (named "title") e of</code></p> <ol start="2"> <li>I've always thought that is best to deal with elements directly as they occur, not as children of other elements, as this creates the danger of either duplicating those children down the recursion line or, in an attempt to avoid duplication, their unintentional removal (which is precisely what has happened with titles). However, it seems to me that in the particular case of <code>sec</code>, dealing with children inside the parent case is unavoidable, as we want to create a single Header with both label and text together, i.e. : </li> </ol> <p><a href="https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L330-L335">https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L330-L335</a></p> <p>then</p> <p><a href="https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L341">https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L341</a></p> <p>Which means, the <code>label</code> text has been prepended to the <code>headerText</code> before the building of the Header, which cannot be achieved if <code>label</code> and <code>text</code> are dealt with as independent cases. So we might need to leave the case for <code>sec</code> as it is. </p> <ol start="3"> <li>But then, if we were to replace <a href="https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L185">https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L185</a></li> </ol> <p>by anything that produces actual content, we will get a duplicate title when it comes to parse the internal contents of<code>sec</code>, in L338?: <a href="https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L338">https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L338</a></p> <p>Since <code>getBlocks</code> calls <code>parseBlock</code>, which would build a title.</p> <p>For me, the ideal solution would be to take the processing of <code>label</code> and <code>title</code>, i.e. the below lines: <a href="https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L326-L335">https://github.com/jgm/pandoc/blob/16f28ef5e945f3be14e05afb7d91f8adca18e49a/src/Text/Pandoc/Readers/JATS.hs#L326-L335</a> out of the <code>sect</code> function and into the respective cases of <code>title</code> and <code>label</code>, replacing the <code>mempty</code> lines, and proceed as you suggest. But then we will have to give up on having Sections with one single Header with title and label on it. </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kamoe"><img src="https://avatars.githubusercontent.com/u/8106866?v=4" />kamoe</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>I just reviewed the mechanics of <code>label</code> and <code>title</code> in JATS when they co-occur, and it really seems like having them both on the same heading is imperative (e.g. "Section 8 Technical background", where "Section 8" is the <code>label</code> and "Technical background" is the <code>title</code>).</p> <p>As such, I don't think the ideal solution of dealing with <code>title</code> and <code>label</code> individually outside of <code>sect</code> is actually a good one. Hence why <a href="https://github.com/jgm/pandoc/issues/8718#issuecomment-1531045526">I suggested to simply write the cases for the missing elements </a> and leaving <code>sect</code> as it is. It's not nice or compact, but it's less problematic, and it's guaranteed to fix the issue.</p> <p>But let me know I am missing something important here that could make the ideal alternative work.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/coryschires"><img src="https://avatars.githubusercontent.com/u/104563?v=4" />coryschires</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>@kamoe @jgm Just a quick FYI. On the JATS Standing Committee call today, I was able to ask the group:</p> <blockquote> <p>If Pandoc only targets a single version of JATS, which version should it be?</p> </blockquote> <p>They unanimously agreed that Pandoc should target <em>JATS Archiving and Interchange</em>. This is what @kamoe suggested above, so really just a +1 from the experts that our current plan is a good one.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/noahmalmed"><img src="https://avatars.githubusercontent.com/u/10929131?v=4" />noahmalmed</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>Thanks for all of the direction on this @kamoe ! I'm going to take a crack at it </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/noahmalmed"><img src="https://avatars.githubusercontent.com/u/10929131?v=4" />noahmalmed</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>Got a fix up here: #8840 if anyone wants to take a look!</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kamoe"><img src="https://avatars.githubusercontent.com/u/8106866?v=4" />kamoe</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>@noahmalmed Great work, thank you! I have finished adding all my comments, and have also asked a question on the PR wrt what the best way to proceed is to have this merged. Happy to help in any way I can.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kamoe"><img src="https://avatars.githubusercontent.com/u/8106866?v=4" />kamoe</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>@noahmalmed I'm afraid there is a problem with the current solution tough. If you notice, it only solves the problem for <code><title></code>, but not for <code><label></code>. We are completely removing the <code><label></code> even if no <code><title></code> is present. I thought the rationale was to remove <code><label></code> if and only if also a <code><title></code> was present...</p> <p>@coryschires What do you think?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/noahmalmed"><img src="https://avatars.githubusercontent.com/u/10929131?v=4" />noahmalmed</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>@kamoe Oh shoot, I guess I misunderstood. @jgm , was the idea to only suppress label when title was present or did we want to just fully suppress it?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kamoe"><img src="https://avatars.githubusercontent.com/u/8106866?v=4" />kamoe</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>@noahmalmed Just to clarify, I think it might or might not make sense to leave as is, I am just pointing out that we <em>might</em> need to retain a lone <code><label></code> in certain circumstances.</p> <p>I'm thinking more in terms of current JATS practice, like the examples given above by @coryschires originally when they opened the issue:</p> <pre><code><statement> <label>Hypothesis 1</label> <p>We hypothesize the following</p> </statement> </code></pre> <p>Would "Hypotesis 1" be something that we might absolutely need to keep as label? (knowing there is no title). Or is this something that will eventually be duplicated?</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>

jgm / pandoc