Open castedo opened 2 years ago
I'm tempted to say: if you don't want level-1 headings in the abstract, don't use Markdown #
in that context...
If you don't have control over that, another option is to use a Lua filter that converts level 1 headings in an abstract to something else.
In hindsight, my repo steps are confusing in that I start with start.md
. I should have focused the repo steps starting with the jats.xml
. As more of a side note, the jats.xml
is generated quite well by the start.md
I include. So pandoc
works very well writing JATS XML structured abstracts. Just not so well reading them.
So having <sec>
elements in the <abstract>
element in JATS XML is out of my control so to speak.
I haven't learned how to make a Lua filters, but that sounds like a reasonable approach if one wants to generate HTML and LaTeX from JATS XML.
In my particular situation I have a quick work around so I'm good for now. For the long-term I suspect I will want to upgrade my JATS -> HTML/LaTex conversion from the swiss-army knife that is pandoc to a more specialized knife that only cuts JATS. It's amazing that pandoc can convert so much to so much! But I bet it's inevitable I'll want to upgrade to a specialized JATS -> HTML/LaTex solution soon.
To help clarify this issue, here is a summary. The attached JATS XML example is (roughly):
<?xml ... ?>
<!DOCTYPE ... >
<article ...>
<front>
<article-meta>
<abstract>
<sec id="objective">
<title>Objective</title>
<p>To examine the effectiveness of day hospital attendance</p>
</sec>
<sec id="design">
<title>Design</title>
<p>Systematic review of 12 controlled clinical trials</p>
</sec>
<sec id="subjects">
<title>Subjects</title>
<p>2867 elderly people.</p>
</sec>
</abstract>
...
</article-meta>
...
</front>
...
</article>
which pandoc converts to (roughly):
<html ...>
...
<body>
<header id="title-block-header">
<h1 class="title">JATS an abstract</h1>
<div class="abstract">
<div class="abstract-title">Abstract</div>
<h1 id="objective">Objective</h1>
<p>To examine the effectiveness of day hospital attendance</p>
<h1 id="design">Design</h1>
<p>Systematic review of 12 controlled clinical trials</p>
<h1 id="subjects">Subjects</h1>
<p>2867 elderly people.</p>
</div>
</header>
</body>
</html>
where the pandoc template variable $abstract$
get the value:
<h1 id="objective">Objective</h1>
<p>To examine the effectiveness of day hospital attendance</p>
<h1 id="design">Design</h1>
<p>Systematic review of 12 controlled clinical trials</p>
<h1 id="subjects">Subjects</h1>
<p>2867 elderly people.</p>
So the issue here is that pandoc is converting jATS
<article ...><front><article-meta><abstract><sec><title>
to HTML <h1>
. This is essentially never gong to be the HTML that somebody wants for an abstract that is embedded inside a full-text document. There should only be one <h1>
for the document and it is should not be inside the abstract.
The root cause here is that front//abstract/sec
elements are using the same function than body//abstract/sec
elements, and I see why the outcome should be different.
A solution to this could be to write a customized treatment for front//abstract
elements, changing the below, default recursive line inside the getAbstract
function:
To a behaviour that processes the inner <sec>
s without adding a header at level current+1 (which is the behaviour for secs inside <body>
that is currently applied in parseBlock
and by transitivity in getBlocks
). Could be achieved by an analogous "front" function, e.g. getFrontBlocks
function, that does not append current level+1 headers to it.
Sounds like a promising idea. Thanks for thinking it out! However to be honest, I barely understand the code. I'm not very fluent in Haskell.
A net result that seems like a big improvement is something like:
<article ...>
<front>
<article-meta>
<abstract>
<sec id="objective">
<title>Objective</title>
<p>To examine the effectiveness of day hospital attendance</p>
</sec>
getting converted to
<html ...>
...
<body>
<header id="title-block-header">
...
<div class="abstract">
<div class="abstract-title">Abstract</div>
<div class="abstract-subtitle-1" id="objective">Objective</div>
<p>To examine the effectiveness of day hospital attendance</p>
But if it's easier to output <h2>
rather than <div class="abstract-subtitle-1">
that is certainly better than outputing <h1>
.
An easier fix would be to add a function to the abstract processing that just converts the Header elements to something more appropriate. [EDIT: of course, this can be done already using a filter.]
FWIW, my really easy fix is to just not use headers in abstracts. :sweat_smile:
So as an author I do this instead of authoring section headers:
\textbf{AUDIENCE}: Developers and early adopters of tools and services for research communication.
\textbf{STAGE}: Edition 2 planned. Feedback welcome.
which after LaTeX -> JATS -> HTML
ends up not looking too bad:
https://perm.pub/H5NOlCVM9P5Vv4LbeuwJsaME8kM
but it is not very semantic.
An easier fix would be to add a function to the abstract processing that just converts the Header elements to something more appropriate. [EDIT: of course, this can be done already using a filter.]
Actually, an even easier solution would be to wrap the getBlocks
line in manipulations of the header level to an agreed value, in much the same way it is (used to be done) in the treatment for sec
here:
So the getAbstract
function would look like:
getAbstract :: PandocMonad m => Element -> JATS m ()
getAbstract e =
case filterElement (named "abstract") e of
Just s -> do
oldN <- gets jatsSectionLevel
modify $ \st -> st{ jatsSectionLevel = 6 } -- or whatever level we agree on for front headers
blks <- getBlocks s
modify $ \st -> st{ jatsSectionLevel = oldN }
addMeta "abstract" blks
Nothing -> pure ()
To be honest, this level of JATS processing is probably beyond the scope of pandoc. I imagine at some point, there is a level of JATS specific semantics for which pandoc is no longer the right tool for the job. So I'm labeling this as an enhancement.
Nonetheless, I report the limitation here with pandoc 2.18.
REPO STEPS With source.md
GOT jats.xml.txt got.tex.txt got.html.txt
EXPECTED
The abstract to NOT be the same conversion of JATS XML
<sec>
,<title>
,<p>
elements that is done in the body. Rather it should be something semantic for the abstract section.For instance, the HTML generated for the abstract is
which isn't really right because
Objective
is not an h1 level heading. Although CSS is powerful enought to hack around this, it would be more appropriate to output something like:In the case of LaTex, the current output is:
my guess is there is a way to hack around this in LaTeX but I'm not as knowledgeable with LaTeX as HTML/CSS.
Currently the default output look pretty bad for JATS structured abstracts in both default HTML and LaTeX.